Introduction

Recent advancements in the field of deep learning (DL) have catalyzed innovative approaches to materials science1,2,3,4,5,6,7,8,9,10,11,12. In this regard, there have been a number of attempts to achieve DL-driven materials analyses and characterizations using one-dimensional (1D) powder XRD patterns. The 1D powder XRD patterns have long been considered as an archetypal crystal structure data and thereby the DL approach has been also focused on the 1D powder XRD pattern for inorganic materials due to its popularity and familiarity in the materials science society13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38, although several successful DL approaches to 2D (X-ray or electron) diffraction patterns have also been reported39,40,41,42,43. When considering the 1D powder XRD-based approaches, the symmetry classification13,14,15,16,17,18,19,20,21,44, the phase identification (demixing)22,23,24,25,26,27,28,45,46, the property regression21,29,30, and the clustering and XRD generation21,31,32,33 have been successful for inorganic materials, along with some previous pioneering machine learning (ML)-based XRD analyses34,35,36,37,38. Although the XRD-driven ML/DL approach is yet to be outperforming the traditional knowledge-driven approaches based on well-established, rule-based software packages47,48,49,50,51,52,53, it would be drastically improving soon along with the rapid progress in DL techniques.

The majority of the currently booming XRD-driven ML/DL approaches employ convolutional neural network (CNN)-based methodologies, treating the powder XRD pattern as nothing but a conventional 1D image, with characteristic feature extraction occurring across the CNN layers. While 1D powder XRD data has proven instrumental for symmetry classification and phase identification, it has shown limited efficacy in property regression20. In contrast, the crystal graph convolutional neural network (CGCNN) and its subsequent variants have demonstrated aptitude in the property regression domain11,54,55,56. Namely, the former excels in symmetry classification but falters in property regression, while the latter exhibits the opposite trend20. This trade-off in performance between XRD-based CNN and descriptor-based CGCNN would not seem to be eliminated.

The CGCNN integrates local structural data, represented through nodes and edges. These nodes and edges encapsulate knowledge-driven descriptors, detailing constituent elements and bond characteristics54,55. Furthermore, there has been very recently introduced an improved graph-based model, the Connectivity Optimized Nested Graph Network (coGN)57, which incorporates all crystal symmetry operations for an asymmetric unit and substantially augments property regression capabilities for inorganic crystals leading to the state-of-the-art (SOTA) record in benchmarks58, which is presented in Supplementary Fig. 1. Nonetheless, the coGN’s proficiency in symmetry classification still remains notably inferior to the XRD-based CNN.

In this study, our primary emphasis is on exploring alternative DL models for symmetry classification and property regression. We deliberately eschew the use of XRD-based CNN and CGCNN-related approaches, implying that the 1D powder XRD data and graph-type data, typically employed in the previous models, are not the focal points of our investigation. Instead, we adopt the 3D electron density (charge density) in the real lattice as training data for our DL models. To adeptly handle this 3D electron density data, we incorporate DL architectures such as PointNet59 and Sparse 3D CNN60,61,62. Although conventional 3D CNNs have been utilized for property regression within specific material systems, including several face-centered cubic (fcc) materials63, Sparse 3D CNNs, which are capable of handling large-scale datasets, have not been extensively focused on in the field of materials science, Sparse 3D CNNs that can handle large scale datasets have not been a focal point in materials science. In contrast, there have been PointNet-based DL approaches driven by 3D crystal structural data. For example, PointNet has been applied in crystal structure-related studies, particularly in the context of molecular dynamics (MD)64,65. In recent developments like DeepCrysTet66, crystal structures have been innovatively modeled using 3D tetrahedral meshes, a process facilitated by Delaunay tetrahedralization. Following this, a new methodology known as Neural Structure Field (NeSF) emerged, addressing the limitations associated with voxelization67. However, these previous 3D methodologies considered each atom as a point, forming a point cloud with atomic coordinates and features. The concept of treating electron density at arbitrary lattice locations as a point cloud has not been previously explored. Our electron-density-driven approach marks a significant breakthrough in the realm of DL-driven 3D crystal structure analysis by emphasizing electron density as a pseudo continuous entity, rather than merely considering it as discrete atomic positions within the lattice.

Focusing on 3D electron density data may be more advantageous than depending solely on atomic arrangement data, despite both being derived from theoretical crystal structure solutions (such as atomic coordinates, occupancy, thermal factors, and lattice parameters), originally obtained from experimental data diffracted in reciprocal space. Utilizing 3D electron density data would be particularly practical, especially in light of experimental techniques like electron tomography68,69,70. Our primary goal extends beyond merely predicting material properties from known crystal structure solutions. Instead, we aim to identify unknown materials through their experimental electron density patterns. While the techniques for measuring 3D electron density directly are still in progress and not yet fully established, the DL approach utilizing 3D electron density data remains promising due to the ongoing progress in developing these measurement methodologies.

We compiled two unique datasets of 3D electron density: the first, derived from CHGCAR files based on Density Functional Theory (DFT) calculations from the Materials Project (MP)71, and the second, obtained from crystal structure solution data from the Inorganic Crystal Structure Database (ICSD)72, predominantly validated experimentally. These datasets were transformed into point cloud and sparse tensor formats for use with PointNet and Sparse 3D convolutional networks, respectively. Our research conducted a thorough comparative analysis of various DL models including Fully Convolved Neural Network (FCN)73, CGCNN, coGN, PointNet, and Sparse 3D CNN, with diverse training data types such as 1D powder XRD, graph-type data, and 3D electron density data. This comparative study was systematically detailed, with particular emphasis on the finding that the Sparse 3D CNN model, especially when utilizing ICSD-derived 3D electron density data, demonstrated superior performance compared to other models.

Results

Previous XRD-driven DL approaches

Symmetry classification in crystallography involves identifying the crystal system (seven classes), extinction group (101 classes), and space group (230 classes) by analyzing powder XRD patterns using CNN-based deep learning models. Schuetzke et al.19 expanded the scope of symmetry classification to encompass structure type classification, incorporating 1D spectral data from XRD, Nuclear Magnetic Resonance (NMR), and Raman scattering. Structure types, or prototype structures, are categorized by the ICSD following guidelines by Allmann and Hinek74. Our study, however, focuses on classifying the crystal system, extinction group, and space group. The crystal system is typically identified from powder XRD patterns using commercial indexing tools like ITO47, TREOR48, DICVOL49, McMaille50, EXPO51, and X-CELL52. Space group determination is more complex than simple indexing, as systematic absences in powder XRD patterns may suggest multiple space group candidates within an extinction group. While powder XRD patterns do not guarantee space group identification for inorganic compounds, single crystal diffraction data usually enable this. Powder XRD data, therefore, are mainly used to ascertain the crystal system, extinction group, and lattice parameter. In parallel to traditional XRD analysis, DL approaches face similar challenges for space group determination. While test accuracy for crystal system classification exceeds 90% across various DL-driven approaches13,14,15,16,17,18,19,20,21,44, accuracy for space group identification often falls far below 90%. Despite the difficulty in making direct comparisons due to varying training and testing datasets, the state-of-the-art (SOTA) test accuracy for powder-XRD-based crystal system classification was reported at 94.9%13. However, this high accuracy was achieved using a researcher-curated dataset comprising only 150k ICSD entries (the number of inorganic compounds in the training dataset). Since this initial foray into DL for powder XRD analysis, numerous attempts have been made using larger datasets14,15,16,17,18,19,20,21,44.

Suzuki et al.16 made a significant observation that most misclassified cases were in lower symmetry groups such as orthorhombic, monoclinic, and triclinic, a finding later reaffirmed by Lee et al.20. Interestingly, Park et al.13 had excluded low symmetry entries from their dataset, unaware of Suzuki et al.‘s16 discovery, which likely inflated the test accuracy for crystal system classification. Therefore, the more realistic SOTA test accuracy for crystal system classification is around 93.06%72, as determined using a more comprehensive dataset that includes almost all entries (197,131 inorganic compounds) from the ICSD as of 2023. This dataset has been utilized in two subsequent studies20,21, where FCNs were applied for symmetry classification. FCNs diverge from the conventional CNNs with several fully connected layers commonly used in previous DL-driven XRD analyses. The FCN method proved more effective than the conventional CNN approach, achieving higher performance with even fewer parameters20. In these studies, the 1D powder XRD-driven FCN approach achieved state-of-the-art (SOTA) test accuracy of 93.06% for crystal system classification, considered the current upper limit21. Interestingly, despite the transformer’s known superiority over CNN in certain image-related DL tasks75,76, it did not surpass FCN in XRD-driven symmetry classification20.

Table 1 summarizes recent DL-driven results in symmetry classification and phase identification. Generalizability refers to whether the training dataset encompasses the entirety of inorganic compounds or just a small portion. Most phase identification studies have used a narrow range of materials within a limited composition space, leading to low generalizability. For example, Lee et al.’s22 study focused on the Li–Al–Si–O–N composition system, Oviedo et al.’s14 on thin-film metal halides limited to seven space-group categories, Szymanski et al.’s26 on the Li–Mn–Ti–O–F composition space, Massuyeau et al.’s24 on chlorides, bromides, and iodides in perovskite or non-perovskite structures, and Maffettone et al.’s27 on specific systems like BaTiO3 phase transition, ADTA crystal structure prediction, and phase mapping of the Ni–Co–Al alloy system. In contrast, DL models for symmetry classification tend to cover a broader range of inorganic materials registered in the ICSD. Studies by Park et al.13, Vecsei et al.15, Suzuki et al.16, Lee et al.20,21, and Salgado et al.44 have almost fully encompassed inorganic compounds from the ICSD for training and testing their DL models in symmetry classification.

Table 1 The recent advances in DL-driven crystallographic symmetry classification and phase identification

The perturbation column in Table 1 exhibits various attempts at perturbation to achieve more realistic XRD patterns when simulating XRD patterns. Oviedo et al.14 considered physics-informed data perturbations common in thin film samples, like texture and epitaxial strain. Maffettone et al.27 used an ensemble of 50 CNN-based classifiers to efficiently handle perturbations without combinatorial explosion. Wang et al.28 synthesized new training spectra by merging perturbations extracted from experimental data with theoretical spectra. Szymanski et al.16 employed physics-informed data augmentation, incorporating perturbations into synthetic XRD patterns considering domain size, lattice parameters, and random preferred orientation. In their method, removing identified phases from blended XRD aided in determining all constituent phases26. Schuetzke et al.18 aimed for well-perturbed XRD patterns resembling real-world samples and systematically studied the impact of various perturbation parameters on the F1 score. Additionally, some studies have directly incorporated perturbations into DL models rather than relying on perturbed data augmentation. For instance, Chen et al.25 introduced a deep reasoning network (DRN) that merges DL with constraint reasoning to include prior scientific knowledge. This approach required only a minimal amount of unlabeled data to address perturbations, as these are integrated into the DRN’s latent embedding layer, reducing the need for a large number of synthetic XRD patterns with perturbations. The data size column in Table 1 shows the total number of XRD patterns used in the training dataset, along with other columns exhibiting the number of classes and their corresponding hold-out test accuracies. Most high-accuracy results in phase identification and symmetry classification were achieved using conventional CNN models, including inception nets77 and residual nets78, while Vecsei et al.15 argued for the superiority of a basic ANN with a multi-layer perceptron (MLP) architecture over CNNs. However, Lee et al.20,21 reported superior performances with FCNs compared to the others.

We must acknowledge that the accuracy levels reported for methods utilizing the ICSD and MP databases might be somewhat overstated, with each approach purportedly achieving test accuracies above 90%. This notably high performance in symmetry classification could stem from the extensive redundancy within the materials dataset. Despite this, addressing this issue is practically challenging, even with the adoption of advanced validation strategies like leave-one-out cross-validation79. The only feasible approach appears to be manually identifying and removing highly redundant (duplicated) entries on a subjective basis when partitioning the dataset into training, validation, and test sets. A clear and objective criterion for measuring similarity between entries is absent. Moreover, it is challenging to identify a representative structure among those deemed similar, even if such a criterion were available. However, this overestimation due to redundancy might not significantly impact the evaluation of ML model performance, especially since our primary focus was on comparing the relative effectiveness of different ML models.

3D electron density data-driven DL approach

Traditionally, 1D or 2D diffraction patterns are represented in reciprocal space, an essential approach given that experimental methods such as X-ray, neutron, and electron diffraction rely on interpretation in reciprocal space. While reciprocal interpretation is practical, understanding crystal structures in real space, which avoids the need for forward and backward Fourier transformation, is often more intuitive. The historical preference for reciprocal space in crystal structure analysis tools and methodologies stems from a lack of high-resolution real-space imaging techniques. However, recent advances in experimental methods are enhancing the feasibility of real-space imaging. For example, Scanning Transmission Electron Microscopy (STEM) provides high-resolution real-space imaging, which is vital in materials science and nanotechnology. Contrary to traditional diffraction methods that provide data in reciprocal space, like spatial frequencies or wavevectors, STEM techniques make real-space imaging available, despite their current limitation to 2D. Although this evolving capability does not yet indicate a bright future for real-space 3D imaging techniques, it might be hopefully available by advancing beyond current incomplete 3D technologies such as tomography-based 3D STEM imaging68,69,70 and 3D XRD imaging80 techniques. Following this, the implementation of a 3D-electron-density-driven DL approach promises to be extremely beneficial, even though experimental 3D data have yet to be realized.

Ziletti et al.40 were the first to use 3D crystal data as a descriptor, projecting it onto 2D reciprocal space to create a ‘two-dimensional diffraction fingerprint (DF)’, which served as an effective pseudo-3D descriptor for deep learning. However, DF remains within the realm of reciprocal space, akin to conventional 2D diffraction, and requires expensive analytical interpretation to reproduce the actual crystal structure from 2D DF data, despite its benefits such as size invariance and defect accommodation40. Building on the groundwork laid by Ziletti et al.40, Tsuruta et al.65 and Chiba et al.66 have also employed 3D crystal data as descriptors for representing crystal structures, specifically for application in DL approaches. In contrast to these previous works, which are based on discrete atomic positions, our approach focuses on treating electron density as a pseudo continuous field. Sparse 3D electron density in real space was marked as the first real-space crystal structure descriptor for scaled DL. Even though Zhao et al.63 and Saha et al.81 have previously introduced small-scale 3D electron density data for DL applications, they utilized conventional 3D CNNs without incorporating sparsity. The electron density data were sourced through two pathways: one from DFT-calculated CHGCAR files in VASP82,83,84, obtained from the Materials Project (MP) database71, and the other from ICSD structure solutions72 combined with atomic scattering factors85. The ICSD solution data-based electron density was computed using our in-house Python code and the FullProf86 and VESTA87 software API.

The optimal DL model for effectively handling the 3D electron density descriptor in real space is the 3D convolutional neural network (CNN)88. However, using unit cells or supercells from every ICSD and MP entry as input for 3D CNNs introduces two major challenges. The first is the varying cell shapes across different crystal systems, and the second is the disparity in unit cell size between small and large periodicities. One solution is to use a large, uniformly shaped input volume unit, such as extracting a 20 × 20 × 20 ų cube from each ICSD and MP entry. This approach ensures consistency in volume and voxel count, and thereby a requirement for traditional dense 3D CNN models can be met. However, this method significantly increases computational overhead. Additional issue would arise such that a standardized input volume might encompass multiple unit cells for entries with smaller cells, while only capturing a fraction of a cell for entries with larger cells. This discrepancy can disrupt the seamless integration of lattice periodicity information during the DL model training process. Recognizing the limitations of traditional dense 3D CNNs, we have adopted PointNet59 and Sparse 3D Convolutional Neural Networks (Sparse 3D CNNs)60,61,62 as more practical alternatives. The point cloud and sparse tensor data formats, optimized for use in PointNet and Sparse 3D CNN models, adeptly tackle two major issues: the issue of varying shapes and sizes of unit cells and the computational challenge. Derived from downsampling the originally dense electron density data, these data formats inherently become sparser, which leads to more efficient management of computational load.

Figure 1 schematically illustrates a summary of the proposed 3D-electron-density-driven DL approach, and the schematic representations of both PointNet and Sparse 3D CNN are depicted in Fig. 2, with their detailed architectures illustrated in Supplementary Fig. 2. PointNet is adept at analyzing point cloud data, which captures 3D objects as discrete points in space, extracting features and patterns crucial for tasks like object recognition, classification, and segmentation. Its utility is especially notable in computer vision and 3D scene understanding, allowing for direct interaction with 3D spatial data. A key advantage of PointNet is its size invariance; it can process point clouds irrespective of the points’ order or quantity. The size invariance is achieved through a symmetric function that assesses each point independently, facilitating efficient handling of point clouds of varying sizes. This characteristic is particularly beneficial for tasks like symmetry classification in crystal structure analysis, as it allows PointNet to robustly adapt to the varying number of points typically seen in electron density data for ICSD and MP entries. Thus, PointNet can handle variously sized unit cells from all ICSD and MP entries without needing the standardized input volume unit required by traditional dense 3D CNN models.

Fig. 1: Schematic illustration of symmetry classification and property regression using 3D electron Density.
figure 1

The two leftmost cells in the image illustrate the methods of generating 3D electron density. The red section represents 3D electron density from the MP database, derived from ab-initio DFT calculations, while the blue sections depict 3D electron density from the ICSD, generated using structure factors determined from experimental structure analyses. The middle gray cells illustrate typical examples of point cloud and sparse tensor formats, which are transformed from dense voxel-based electron density data and utilized in PointNet and Sparse 3D CNN, respectively. The DL model illustration demonstrates the use of either point cloud or sparse tensor data for symmetry classification, aiming to predict 7 crystal systems, 101 extinction groups, and 230 space groups. Notably, Sparse 3D CNN is exclusively used for property regression, specifically to predict formation energy (Ef) and band gap (Eg). The red and blue arrows in the image indicate the source of the 3D electron density data, with red for MP and blue for ICSD. Dashed and solid lines represent the PointNet and Sparse 3D CNN models, respectively.

Fig. 2: Overview of PointNet and 3D Sparse CNN architectures.
figure 2

a PointNet: Visualization of the PointNet architecture, focusing on the symmetric function. b Sparse 3D CNN Process: This illustration delineates the sparse convolutional process, beginning with rulebook creation and execution of sparse convolution. The visualization of the PointNet and 3D Sparse CNN architectures in Supplementary Fig. 2, and more detailed descriptions of the PointNet and 3D Sparse CNN functioning are presented in Supplementary Fig. 3.

Sparse 3D CNNs excel in analyzing 3D data, especially when much of it is sparse or empty60,61,62. These networks are adept at processing and extracting features from occupied segments of 3D volumetric grids (voxels), which makes them ideal for tasks like 3D object detection, segmentation, and scene understanding. By disregarding empty spaces and utilizing sparse tensors, Sparse 3D CNNs minimize computational overhead, proving efficient in scenarios where not all parts of the 3D space contain relevant information. To comprehend the benefits of Sparse 3D CNNs in analyzing 3D crystal structures, it is essential to first understand the downsampling methods. We sampled data to create sparse tensors from the original, densely voxelized 3D data with a grid size of 0.06 Å (CHGCAR data from MP and electron density data from ICSD), which are considered as being pseudo continuous.

The posterior random sampling process results in a significant number of empty voxels in the resultant sparse tensor, which is clearly illustrated as ‘SparseTensorIn‘ in Fig. 2b, wherein only a non-zero voxel (p1) is present. Another aspect of this approach is the introduction of a sufficiently large virtual spatial frame (sparse tensor dimension), as shown in Supplementary Fig. 4. This conceptual frame can accommodate the largest unit cell found in the ICSD and MP databases, allowing for a range of unit cell sizes. The virtual frame is not fixed but rather imaginary. The size of the virtual frame is not a parameter required for Sparse 3D CNN execution; it was introduced to emphasize the size independence of the Sparse 3D CNN execution. In reality, the size of input 3D images has no limitation, as the index (vin and vout in Fig. 2b) can make convolution available only where the electron density is non-zero. The primary focus is on the non-zero elements indexed by their coordinates, ensuring that the overall size of the input grid does not significantly impact the processing. This confirms that Sparse 3D CNNs enable the management of inputs with different dimensions without requiring uniform resizing or padding, and thereby efficiently handle diverse crystal periodicities (various unit cell sizes).

In contrast to the fixed input volume of traditional dense 3D CNNs, this approach significantly reduces computational costs. Supplementary Fig. 4 showcases typical point cloud and sparse tensor data for PointNet and Sparse 3D CNN, featuring four different inorganic compounds with varying unit cell shapes and sizes. While Supplementary Fig. 4a and e presents a large cubic unit cell, Supplementary Figs. 4b–d and f–h depict smaller unit cells with isotropic, rod-shaped, and plate-shaped structures. Besides the downsampling methods described above, we have incorporated another method to clarify the sampling effect on the final DL performance. Grid-based sampling outperformed random sampling, such that the test accuracies were slightly enhanced to 97.57%, 91.09%, and 90.55% for crystal system, extinction group, and space group classifications, respectively. Supplementary Figs. 5 and 6 and Supplementary Note 1 describes details on both the downsampling methods (random and grid-based samplings) that we adopted in the study.

DL model performances

Figure 3 and Supplementary Table 1 present the results of symmetry classification for both the MP and the ICSD datasets. The MP dataset, referred to as 120k_MP, comprises 122,689 entries (both real and virtual inorganic compounds), validated through DFT calculations. For the purpose of model training, the 120k_MP dataset was formatted into four distinct types to accommodate various models: coGN, FCN, PointNet, and Sparse 3D CNN. Specifically, the dataset was structured as graph-structured data for coGN, a 1D X-ray diffraction (XRD) pattern for FCN, a point cloud for PointNet, and a sparse tensor for Sparse 3D CNN. In terms of hold-out test accuracy for symmetry classification, the models were ranked as follows: Sparse 3D CNN > PointNet > FCN > coGN. The F1 scores mirrored this ranking, with Sparse 3D CNN being the highest, followed by PointNet, FCN, and coGN. Additionally, the ICSD dataset, labeled as 190k_ICSD that comprises 195,300 entries (most of them are experimentally realized inorganic compounds), was evaluated with a primary focus on its impressive test accuracies and F1 scores, which surpassed those of the 120k_MP dataset. The model comparison for the 190k_ICSD dataset revealed a similar trend to that observed with the 120k_MP dataset, namely, Sparse 3D CNN > PointNet > FCN for test accuracy and F1 score.

Fig. 3: Comparative performance metrics for crystallographic symmetry classification using DL Models.
figure 3

This figure presents bar plots that compare the performance of coGN, FCN, PointNet, and Sparse 3D CNN models in three symmetry classification categories: crystal system (left), extinction group (middle), and space group (right). For each category and model, the broader bars represent top-1 hold-out test accuracy, while the overlaid lighter bars indicate the F1 score, a measure of test precision. a Displays hold-out test accuracy results obtained using the 120k_MP dataset. b Shows hold-out test accuracy outcomes from the 190k_ICSD dataset. Note: The results for the coGN model with the 190k_ICSD dataset are excluded due to compatibility issues between the model and the dataset.

Notably, coGN was not utilized for symmetry classification in the 190k_ICSD dataset for convenience. The ICSD comprises a significant number of real-world inorganic compounds, with a large portion featuring partially occupied (disordered) structures; approximately 120,000 of the around 200,000 ICSD entries exhibit this characteristic72. The node features in coGN encompass various atomic properties, including atomic number, atomic mass, atomic radius, electronegativity, ionization energy, and oxidation state. When processing a partially occupied structure, coGN necessitates the assignment of a weighted average for each attribute to every shared atomic position (node) in the graph. However, this approach of averaging node attributes may seem inadequate, despite the pioneering efforts of Chen et al.56, who addressed disorder in their multi-fidelity graph network by weighted-averaging learned elemental embeddings, albeit for a limited set (278) of disordered inorganic materials. More critically, in terms of symmetry classification, addressing the disorder issue may not be beneficial, given that coGN’s performance is inferior to other 3D electron density-driven DL models. Since coGN is specifically tailored for the MP dataset, which exclusively contains fully occupied structures67, applying the ICSD dataset to coGN would be unsuitable.

Sparse 3D CNN, when trained on the 190k_ICSD dataset, achieved a SOTA test accuracy of 97.28% for crystal system classification. This result is significant as it was obtained from the largest dataset, encompassing a broad range of real-world inorganic materials, without any artificial pruning. This level of test accuracy, especially considering the dataset’s size and generalizability, is more notable than the previous SOTA record of 94.99% reported by Park et al.13, which was achieved after the removal of many low symmetry entries. Moreover, the Sparse 3D CNN achieved SOTA test accuracies in both extinction and space group classifications, surpassing 90%. This marks a significant milestone, as neither extinction nor space group classification has previously attained approximately 90% accuracy using any XRD-driven DL models. Specifically, the Sparse 3D CNN reached an accuracy of 90.77% in extinction group classification and 90.10% in space group classification.

With the symmetry classification complete, the focus shifts to a deeper interpretation of the results. One key aspect is the comparison between the 190k_ICSD and 120k_MP datasets. Contrary to expectations, every model trained on the 190k_ICSD dataset outperformed those on the 120k_MP dataset. This disparity might be mainly due to the different symmetry distribution of the datasets. The 120k_MP dataset has a distribution skewed more towards lower symmetry than the 190k_ICSD dataset, as depicted in Supplementary Fig. 7. In addition, the theoretical (virtual) structure entries in the 120k_MP dataset could have impacted test accuracy negatively. The 190k_ICSD dataset contains a significantly smaller proportion of theoretical entries (for instance, 922 virtual entries for ICSD collection codes 1~100,000) compared to the MP dataset, which has over 86,974 virtual entries out of 122,689 in total. Most entries in the 190k_ICSD are experimentally realized, adding to their reliability. According to the 2023 ICSD data72, 91,099 entries are validated by single crystal XRD and 114,150 by powder XRD among the experimental structures in ICSD. This suggests that nearly half of the ICSD data are confirmed by highly reliable single crystal XRD, devoid of peak overlap complications. More importantly, the 120k_MP dataset gives electron density for primitive cells but the 190k_ICSD for Bravais lattice that contains more structural information. These considerations help rationalize the better performance of the ICSD dataset over the MP dataset.

In the 120k_MP dataset, all theoretical structures are assumed to be fully occupied. However, this assumption might be slightly bold when considering the real inorganic structures cataloged in the ICSD. The majority of the 86,974 virtual entries from the 120k_MP dataset feature somewhat unrealistic, fully-occupied, ordered structures that are not found in the 190k_ICSD dataset. Theoretically, these structures, if realized in the real world, could form partially occupied structures, which typically exhibit higher symmetry than their fully occupied counterparts. This potential discrepancy could explain why the 120k_MP dataset appears skewed toward lower symmetry compared to the ICSD dataset, as illustrated in Supplementary Fig. 7. Supporting this observation, the XRD-driven DL approach by Suzuki et al.16 and the subsequent reconfirmation by Lee et al.20,21 indicate that misclassified entries predominantly fall into lower symmetries, such as orthorhombic, monoclinic, and triclinic categories.

The second issue concerns the superiority of PointNet and Sparse 3D CNN over FCN in symmetry classification. Normally, DL models benefit more from training with 3D electron density data in real space compared to relying on 1D powder XRD-based classification. This is because 3D electron density inherently provides more information about the crystal structure. The enhanced performance of PointNet and Sparse 3D CNN is understandable when considering that 1D powder XRD is derived from 3D electron density, and inevitably, some information is lost during the transformation from 3D to 1D data (a process of contraction gives rise to peak overlaps). A well-known challenge in generating 1D powder XRD data is the issue of peak overlap, which can complicate the data analysis process. This issue likely contributes to a reduction in FCN’s accuracy.

The third issue to address is the relatively poor performance of coGN compared to both PointNet and Sparse 3D CNN, where it also lags far behind FCN. coGN, an advanced iteration of CGCNN55, was designed to better represent crystal structures. While CGCNN accounts for the periodicity of the unit cell graph, emphasizing translational symmetry alone, coGN employs an asymmetric unit graph representation that considers all symmetries of a crystal structure. This approach placed coGN at the top rank in the property regression benchmark, as shown in Supplementary Fig. 1. Previous comparative analyses between CGCNN and FCN revealed that CGCNN excelled over FCN in property regression but fell short in symmetry classification74. This pattern persists in comparisons between coGN and 3D electron density-based DL models like PointNet and Sparse 3D CNN. Despite significant enhancements in coGN’s ability to analyze symmetry, thereby improving its property regression capabilities, it still cannot outperform FCN, PointNet, and Sparse 3D CNN in symmetry classification. The reasons behind coGN’s shortcomings in symmetry classification remain unclear, in spite of its substantial advancements in symmetry consideration.

The regression capabilities of coGN were impressive, largely due to the traditional descriptors used for node and edge features in graph-type networks. These descriptors include atomic number, atomic mass, atomic radius, electronegativity, ionization energy, oxidation state, bond length, and others. However, coGN was less effective in capturing symmetry. We believe that the success of graph-based DL models in property regression is primarily attributed to the incorporation of such human-selected features, specifically traditional descriptors that represent atomic and bonding characteristics for node and edge features. This is a key factor in property regression, rather than the graph representation of crystalline structure. Evidence supporting this view is that even a simple sparse composition vector-driven MLP demonstrated nearly equivalent performance to CGCNN in band gap regression20. While coGN’s performance has significantly improved, it appears that much of its enhanced capability can be credited to the selection of traditional material descriptors by domain experts playing a major role in property regression, rather than the graph structure itself.

The fourth issue addresses the marginally inferior performance of PointNet in symmetry classification compared to Sparse 3D CNN, despite its significant advantage over FCN and coGN. PointNet typically processes point cloud data derived from original dense voxel-type data through random downsampling. These point clouds consist of 5000 to 30,000 points, the number of which varies based on the lattice size. The point represents 3D coordinates and their associated feature values, with electron density being the primary feature. The size of the point cloud, dictated by the number of points, might be insufficient for capturing detailed local electron density, potentially leading to suboptimal performance of PointNet. Evidence suggests that increasing the sample size could enhance PointNet’s performance, as indicated by the plot in Supplementary Fig. 8, showing PointNet performance against point cloud size. However, pursuing larger point clouds may be impractical, considering that Sparse 3D CNN achieves superior results at the same point cloud size with lower computational demands. Therefore, opting for Sparse 3D CNN, instead of a larger-sample-based PointNet, appears to be a more efficient approach for achieving improved test accuracy in symmetry classification.

Supplementary Table 2 presents the property regression results for band gap (Eg) and formation energy (Ef) using only the 120k_MP dataset, as the 190k_ICSD dataset absences labels for Eg and Ef. In these regressions, coGN demonstrated superior performance, while Sparse 3D CNN exhibited a higher mean absolute error (MAE). Notably, if Sparse 3D CNN had used only electron density as a feature, the MAE for Eg regression was even higher. The MAE values for Eg and Ef regression were reduced considerably when using a four-feature sparse tensor, which included total electron density, positive spin density, negative spin density, and the difference between the two spin densities (magnetization density), as evidenced in Supplementary Table 3. The inclusion of these spin-related features significantly enhanced the regression results, underscoring the importance of spin considerations in relation to Eg and Ef. However, coGN outperformed Sparse 3D CNN despite the use of a four-feature sparse tensor. It is also noted that the MAE values for coGN-based regression in Supplementary Table 2 are higher (worse) than those in the benchmark58, which can be attributed to differences between the 120k_MP dataset and the dataset used in the benchmark.

The 3D-electron-density-driven DL approach, despite its theoretical advantages, faces practical challenges. Obtaining 3D electron density data requires high-cost computations and is not directly measurable through experimental methods; it relies on crystal structure solution data, which includes atomic positions, occupancies, thermal vibration tensors, as well as exact symmetry (space group) and lattice parameters. Typically, the crystal structure solution for inorganic materials is derived from powder or single crystal XRDs. This practicality issue also applies to other 3D-data-driven approaches like those by Tsuruta et al.66 and Ziletti et al.40, which necessitate known crystal structure solutions to produce training data to be fed to their DL models. Unless the exact crystal structure solution is fully known, the input data for DL models cannot be obtained, as there is currently no experimental means to directly measure them.

In the current landscape, XRD data-driven DL approaches, such as FCN, hold more practical value in experimental materials science. FCN can predict the exact symmetry of an unknown material using its experimental XRD pattern, which is a task easily achievable in standard materials laboratories. Despite its slightly inferior performance compared to Sparse 3D CNN, FCN’s role in materials analysis and discovery remains crucial. The excellent performance of Sparse 3D CNN serves as a potential upper limit for what FCN might achieve in the future. Nevertheless, there is optimism that direct measurement of 3D electron density in experimental settings will be feasible sometime in the future. This anticipated advancement would underscore the practical advantages of our 3D-electron-density-driven DL approach over other methods that rely solely on structure solution data.

Failure analysis for the symmetry classification

A failure analysis using confusion tables, as depicted in Fig. 4, was conducted. Focusing on the 190k_ICSD dataset (Fig. 4c), the confusion table (or confusion matrix) for FCN reveals significant insights into the challenges DL models face in classifying low symmetry structures. Suzuki et al.16 initially highlighted that DL model predictability deteriorates in low symmetry cases, a finding later corroborated by Lee et al.20,21. However, the underlying cause of this deterioration remains unclear. Figure 4c demonstrates that FCN struggles with accuracy and exhibits high mishit rates in the off-diagonal areas, particularly in the low symmetry region (upper-left zone of the confusion table) of the confusion table (Triclinic, Monoclinic, and Orthorhombic), a phenomenon termed the ‘Seattle zone’21. Notably, both PointNet and Sparse 3D CNN do not exhibit this Seattle zone, as evidenced in Fig. 4c. The mishit rates for Sparse 3D CNN-driven symmetry classification are consistently low, ranging from 0 to 0.04, regardless of symmetry, in stark contrast to FCN’s high off-diagonal mishit rates in the ‘Seattle zone’.

Fig. 4: Classification accuracy of crystal systems by DL models represented through confusion matrices.
figure 4

a This section displays representations of four distinct DL models used in symmetry classification: coGN, FCN, PointNet, and 3D Sparse CNN. Each model is symbolized by a central concept illustration. b Confusion matrices showing the classification of 7 crystal systems using the 120k_MP dataset, corresponding to the DL models presented. c Confusion matrices for crystal system classification utilizing the 190k_ICSD dataset. The ‘Seattle zone’ is highlighted with a yellow square in the confusion matrix for the FCN model trained on the 190k_ICSD dataset. Confusion matrices for extinction group and space group classifications are included in Supplementary Fig. 9. A dashed outline indicates the absence of data for the coGN model with the 190k_ICSD dataset, due to an incompatibility issue between the model and the dataset.

Although the Seattle zone appears less pronounced in the 120k_MP dataset compared to the 190k_ICSD dataset, this observation is somewhat misleading. The overall accuracy for the 120k_MP dataset is significantly lower than that for the 190k_ICSD dataset, indicating that results from the MP dataset are equally subpar, irrespective of symmetry. Interestingly, coGN also exhibits the Seattle zone, albeit with an overall decrease in accuracy. The Seattle zone, or the observed accuracy deterioration in low symmetry regions, seems not to be a result of DL model limitations but rather an issue stemming from the data itself. Crystals can be represented as either nuclei arrangements or electron clouds in real space, but contemporary crystallographic analyses predominantly rely on diffractions (electron, neutron, or X-ray) projected onto 2D and 1D reciprocal spaces. Powder XRD, a 1D diffracted projection, inherently suffers from significant information loss, notably the peak-overlap complication. The original 3D electron density data in real space offers a more comprehensive representation of crystals and is thus ideal for DL approaches. The absence of the Seattle zone in Sparse 3D CNN highlights the robustness of 3D electron density data, confirming it as the superior choice for symmetry classification training data. The limitations of 1D powder XRD and graph-type data in accurately representing the structure of inorganic materials become evident as both FCN and coGN exhibit the Seattle zone. The Seattle zone issue, it should be emphasized, stems from data incompleteness, not from any inherent shortcomings in the DL models themselves.

Discussion

The use of 3D electron density data in real space proved to be more effective for DL-driven symmetry classification of inorganic materials compared to traditional crystallographic structural data, such as 1D XRD and graph-type data. However, traditional 3D CNNs struggled with training on immensely dense 3D charge density data due to computational limitations. To alleviate such computational burdens, we introduced sparsity to the 3D electron density data through downsampling, resulting in sparse formats like point clouds and sparse tensors. Consequently, we utilized PointNet and Sparse 3D CNN for symmetry classification, leveraging these sparse 3D electron density data representations.

The combination of the sparse 3D electron density dataset with the Sparse 3D CNN model led to the highest hold-out test accuracy for symmetry classification, achieving 97.28% for crystal system classification, 90.77% for extinction, and 90.10% for space group classification. These results set SOTA records and mark the first time accuracies have surpassed 90% for 230-space-group classification. The PointNet performed comparably to the Sparse 3D CNN. Following them, the FCN, trained on 1D XRD data, and the coGN, trained on graph-type data, demonstrated successive effectiveness.

A comparative analysis of the 190k_ICSD and 120k_MP datasets showed that DL models trained on the 190k_ICSD dataset consistently surpassed those trained on the 120k_MP dataset in performance. This difference could be attributed to the 120k_MP dataset’s traits such as skewed data distribution favoring lower symmetry and a greater prevalence of unrealistic, fully-occupied entries.

In the confusion matrix for earlier 1D XRD- and graph-based DL models, a region known as the “Seattle zone,” characterized by high misclassification rates, was particularly evident in the low symmetry area. However, this was not the case with the 3D electron density-based DL approach. This indicates that difficulties in classifying materials with low symmetry are more a result of data limitations than inherent flaws in the DL models. Traditional structural data, such as 1D powder XRD and graph-type representations, often struggle to fully capture the complex feature of crystal structures. This shortcoming is primarily due to the inherent loss of structural information, a consequence that arises inevitably when original 3D crystal data is condensed into these simpler formats. Nonetheless, graph-based DL models such as CGCNN and coGN are highly effective in property regression, likely benefiting from the inclusion of human-selected descriptors traditionally employed in materials science.

Methods

Electron density data preparation

We utilized databases from MP71 and ICSD72, preparing three different types of data, 1D powder XRD data, graph-type data, and 3D electron density data. The MP database comprises around 150,000 material entries, but charge density data are available for only 122,689 of these, which are the focus of this study. MP data includes structure solution along with bandgap energy (Eg) and formation energy (Ef), facilitating both classification and regression modeling. However, the ICSD database, devoid of material property data, was used solely for symmetry classification modeling.

The 3D electron density data from the MP database were derived from charge density data obtained via DFT calculations using VASP82,83,84. This data was transformed into voxel format using the materials’ lattice parameters and Cartesian coordinates (x,y,z), along with the total charge density (spin up + spin down) data (feature), creating a point cloud in the dimension of N × (x, y, z, features) required by the 3D deep learning models in this study. For property regression modeling with the 3D electron density data from MP database, a single-feature data using only total charge density proved ineffective. In contrast, a four-feature data incorporating total charge density, magnetization density (spin up - spin down), spin up, and spin down data was introduced. For the ICSD data, where 3D charge density is absent and only structural solution data is available, unlike in the MP, we used VESTA86 software to generate electron density voxel data for each material. This data was then converted into 3D point cloud data, similar to the approach for MP data. Lattice parameters for each database were selected based on the primitive cell for MP data and conventional cell (Bravais lattice) parameters for ICSD data. Further details on the 3D electron density data preparation are available in Supplementary Note 2, 3, and 4.

For 1D XRD pattern data in the FCN approach, CIFs from each database were converted into XRD peak data using Fullprof Suite87, following the methodology of Park et al.13. Graph-type data preparation was limited to fully occupied crystal structures, thus only MP data were used. For this, both the CGCNN55, the inaugural graph-type DL approach for materials, and coGN57, the latest crystal graph neural network technique, were referenced.

DL model training

For 1D powder XRD data, we employed the FCN architecture with 1D convolution (Supplementary Fig. 10), as implemented by Lee et al.20,21, for classification and regression training across both MP and ICSD databases. For graph-type crystal data, the training for coGN and CGCNN models was limited to MP data for classification and regression purposes. The architectures of crystal graph-based DL models (CGCNN and coGN) were adapted from previous literature55,57. For 3D electron density data, we focused on two models capable of processing this data type: PointNet, which serves as a baseline for initial 3D point cloud modeling, and Sparse 3D CNN, designed to address the limitations of PointNet. We trained PointNet and Sparse 3D CNN models on MP data for both classification and regression, allowing for a performance comparison across all DL models, including CGCNN, coGN, FCN, PointNet, and Sparse 3D CNN. For ICSD data, which lacks material property data, training with PointNet and Sparse 3D CNN was exclusively focused on symmetry classification, allowing for a performance comparison with the FCN model trained on 1D XRD peak data. The detailed architectures and hyperparameters of the PointNet and Sparse 3D CNN models are presented in Supplementary Fig. 2. Mesh enumeration was used for hyperparameter optimization, with details available in Supplementary Table 4. Data were split into 80% for training, 10% for validation, and 10% for testing across all DL models. The evolution of training and validation performances for PointNet and Sparse 3D CNN, detailed as a function of epoch, are depicted in Supplementary Fig. 11.