Abstract Three Machine Learning algorithms, namely Random Forest, Neural Networks and Extreme Gradient Boosting, were applied to recognize the shape and surface structure of diamond nanoparticles from powder diffraction data. The algorithms were trained to recognize three types of shapes: 1D - rods, 2D - plates and 3D superspheres and, in the case of plate-like shapes, two types of (111) surfaces, with either one or three dangling bonds per surface atom. The classifiers’ training was based on structure functions S(Q) of the nanograin models obtained by Molecular Dynamics simulations. The software tools for models building, diffraction data calculations, and the procedure of grain shape and surface classification are given. It is shown that both Random Forest, Neural Networks and Extreme Gradient Boosting classifiers recognize the shape and surface structure of nanodiamonds with a low number of misclassifications. The derived classifiers were applied to a series of experimental diffraction patterns of diamond nanoparticles with sizes from 1.2 to 3.3 nm. It is shown that ML classification algorithms reproduce very well the results obtained for those samples by real space diffraction data analysis, namely Pair Distribution Function. ML studies prove that the dominating shape of adamantane-synthesized nanodiamonds is a plate terminated by (111) surfaces with 3 dangling bonds of carbon atoms
Similar content being viewed by others
Introduction
The diffraction data analysis of nanocrystalline materials must be supported by a dedicated crystallographic software1,2. The reason for that is an increased number of degrees of freedom of the surface atoms with a decrease of grain size, what has a pronounced effect on changes of the arrangement of nanograin atoms in comparison to the bulk material. In consequence in nanocrystals there occur deviations of atomic positions from those of a perfect crystal lattice. The origin of that behavior is appearance of surface-induced strains that penetrate a considerable fraction of the nanocrystals’ volume.
Complete information on the size, shape and atomic structure of nanocrystals can be derived from diffraction data only by application of sophisticated procedures involving computation of model powder patterns using Debye scattering equation by software specifically dedicated to nanocrystalline materials e.g.3,4. The analysis can be done using both the diffraction pattern e.g.4,5 or the Pair Distribution Function, e.g.,6,7. In6,7is reported that the lattice of nanocrystals 2nm in size and smaller is considerably different from that of larger/bulk crystals. In the past we addressed this issue by using models of nanocrystals relaxed by Molecular Dynamics simulations and confirmed a good agreement between experimental and simulated data for CdSe8, diamond9,10,11, and SiC 12,13. Yet, this kind of analysis is very time consuming and it can be used only occasionally. In consequence, it has a little chance to become a standard method recommended for structural characterization of nanomaterials.
The breakthrough in information processing methods made in the last decade came with the introduction of Machine Learning (ML) technique. In the present paper, we demonstrate results of application of the ML technique to the elaboration of diffraction data of nanocrystals in order to learn about nanoparticle shapes and surfaces. The superiority of ML over other numeric techniques for statistical analysis comes from the ability to fully independently discover relationships between objects through the training stage.
Crucial for the ML algorithm performance is the amount of the data available for processing and training. When applying ML for the solution of crystallographic problems, those multiple datasets are either directly produced during the experiment or are obtained from existing databases14,15. In the so-called serial crystallography, multiple “single shot” diffraction patterns are collected at the pulsed radiation source, and the ML technique is a method of choice for data categorization and the aggregation of meaningful information16,17,18. For rapid determination of crystal symmetry or even indexing of the patterns of unknown substances by ML classifiers, a suitable training dataset may be pulled from over a million identified crystal structures stored in open and commercial databases19,20,21. A comprehensive list of the recent works on the application of ML in classical crystallography can be found in the review articles14,22,23.
Diffraction on nanocrystals has also been addressed in terms of ML techniques. It has been demonstrated that sizes and shapes of nanocrystals could be determined using small angle scattering data, e.g.24,25 as well as wide-angle data, e.g.26. It must be stressed, however, that recently solely the Neural Network algorithms are being used, making it impossible to identify features of the diffraction data that are relevant in the analysis.
In principle, nanoparticles may be conveniently visualized by high resolution electron microscopy27,28,29. However, recovering 3D shape and size information from 2D images may not be reliable, especially when size distribution is present. Due to a strong dependence of the width of Bragg peaks on the crystallite size in the nanoregion, powder diffraction might seem to be a convenient tool to determine their size. Provided the peaks are well separated, the shape of crystallites might be determined by measuring relative intensities and widths of the peaks corresponding to different directions in the crystal lattice. However, this is not the case for the nanoparticles of only 1–5 nm in size for which the above simple relationships between size and shape and Bragg reflections are not fulfilled. A single nanograin is not a small, perfect single crystal, but its internal structure is both size and shape dependent. Its diffraction pattern reflects the presence of intrinsic strains existing in individual nanoparticles, e.g.30. Since the presence of internal strains has a pronounced effect on both the width, positions and relative intensities of Bragg peaks a straightforward analysis of individual Bragg peaks is not feasible for determining the size and shape of a few nm-sized nanocrystalline materials.
Our study is dedicated to those extremely small nanocrystals and here we present results of recognition of nanoparticle shapes from diffraction data with an ML algorithms based on a supervised learning method.
The experimental diffraction data required for training, sufficient for statistical analysis needed for nanoparticles’ shape recognition, are not available since nanomaterials of precisely controlled particle sizes and shapes are still scarce and none are available in sufficiently many size-shape modifications. Training and testing of ML classifiers, which we apply in this work for nanodiamonds was therefore entirely based on simulated diffraction data. In a series of works, we showed that the real structure of nanograins is well reproduced by Molecular Dynamic simulations (MD). Based on MD simulated models of CdSe, diamond and SiC and comparing theoretical to experimental diffraction data, we were able to derive information on very fine details of nanocrystals’ atomic structure as well as determine the orientation of the crystal terminating surfaces and crystal habit (CdSe8, diamond9,10,11 ,31 and SiC13). That work required, however, a tedious comparison between experimental diffraction data and theoretical diffraction patterns of MD-simulated atomic models of nanograins. In the current report, we designate the task to ML algorithms. To examine the applicability of ML algorithm to grain shape recognition, we selected diamond as a model system for its structural simplicity and similarity to a wide class of substances with hexagonal close-packed (HCP) structure that are of importance for nanoscience, e.g. CdSe, ZnO, SiC, GaN, GaP. Since we are in possession of extremely small nanodiamonds which we already measured and characterized by PDF analysis11, this work serves for testing applicability of ML algorithms for analysis of the actual experimental diffraction data and, at the same time, for verification of our previous results.
The grain models generated for building the training database were composed of between 100 and 5000 atoms, i.e. their sizes were between 1 nm and 4 nm for 3D shapes. Three shape categories were considered, namely rods (1D), plates (2D), and superspheres/superellipsoids (3D), see Fig. 1. The term “superspheres” is used here to describe the nearly isotropic crystallites following the “supersphere equation” used in this work to create appropriate models – see Supplementary Materials . The models that met the requirements of ML are available to download from32.
Nanograin models with initially perfect diamond lattices were the subject of MD simulations. The MD calculations introduce collective thermal motions, i.e., phonons, and also the surface induced lattice strains and so provide a realistic approximation of the atomic structure of actual nanoparticles at \(T=300K\). Based on them, X-ray powder diffraction patterns were calculated using the Debye formula33.
The diffraction data under elaboration were structure functions S(Q), where Q is the module of the scattering vector. S(Q) is essentially the diffraction pattern, and contains only the information on crystal structure. It is cleaned of the scattering processes characteristics and sample chemical composition. Another feature relevant for structure analysis is that there is no need to analyze the full range of Q-values due to the high information redundancy of diffraction data. We show that data selection can be done for almost any arbitrary range of S(Q) data to obtain low-error predictions. This is an important feature due to its application in experimental data processing, whose characteristics differ from theoretically obtained ones, and selection may be necessary.
Prior to ML-based data, processing the irrelevant signals from experimental data should be removed20. Here, it has been done by PDFgetX2 software in the way routinely used for the PDF analysis . Next the data were subjected to the removal of the high-frequency noise and the further background correction. Details are given in Sect. “Shape determination” and in Supplementary Materials.
The software package used for model building and diffraction data calculation was the npcl program34. This program is a successor of the NanoPDF642 program with greatly enhanced capabilities and operates on both Windows® and Unix/Linux machines. The MD simulations were run under the LAMMPS software package35,36; The three ML classification algorithms used were: Random Forest, from the Scikit-Learn v. 1.0.2 package37; Neural Network from the Keras package38 and Extreme Gradient Boosting from39,40. The Python scripts for ML training and validation, experimental data processing, and shape recognition are placed at41.
Both theoretical and experimental diffraction data were X-ray powder patterns obtained for \(\lambda =0.561\)Å wavelength (silver anode), in the Q-range up to \(Q^{Ag}_{max}\approx 21\)Å−1.
Specific problems of diffraction data analysis of nanograins
Conventional analysis of the size and shape of grains of a polycrystalline material refers to the width and relative intensities of Bragg reflections. Considering only powder samples with no preferred orientation and no externally induced strains, the simple rules are such that broader Bragg peaks mean smaller grain dimensions, while the differences in peak widths and heights may indicate anisotropy of the grain shape. Such simple rules work well only if the peaks in the pattern are well separated, i.e., for particles of about 5 nm and larger but not for smaller nanograins.
For high symmetry structures, such as diamond, withdrawal of shape information from Bragg peaks is even more challenging what is demonstrated in Fig. 2. The figure shows that even for the [111] direction, which directly corresponds to the height of a plate/rod-shaped grain, the information on the grain height is contained in only a portion of the intensity and shape of the measured 111 peak. This is because the other three out of four equivalent [111] directions that contribute to the (111) peak are inclined regarding the given axis of the plate/rod particle. This concerns also other reflections that combine information on dimensions of a particle in multiple directions, e.g. [220] and \([\bar{2}20]\), Fig. 2.
Comparison of S(Q) plots of three types of grain shapes presented in Fig. 3 shows that there are no strict correlations between relative Bragg intensities and specific grain shapes. For instance, one may expect that for the plate-shaped grain, the relative intensity of 220/111 reflections should always be larger than that for an isotropic supersphere grain and it should be smaller than that measured for rod-shaped grains. Fig. 3 shows that this rule is fulfilled for the grains with about 5000 atoms, but it is not fulfilled for the smaller grains: (i) for rods, the 111/220 ratio is about 1.2, it is only a little larger than 1 for grain with 1500 atoms, but it is smaller than 1.0 for rods with 200 atoms; (ii) for plates with 5000 atoms, the 111/220 ratio is about 0.8; it is only a little smaller than 1 for grains with 1500 atoms, and for the smallest plate, it is about 1.0. An obvious conclusion that follows from Fig. 3 is that relative intensities of Bragg reflections alone measured for a few nm nanograins do not contain a unique and sufficient information on their actual shape.
The other, even more difficult problem with identification of shape that is specific for nanograins is that their internal structure evolves with the size, shape and surface structure. All these factors have different effects on diffraction patterns. In Fig. 4a, S(Q) of the model with a perfect crystal lattice is compared to S(Q) of this model after MD relaxation. It demonstrates that both relative intensities and also peak positions change due to lattice relaxation induced by MD simulation. A similar effect is presented in Fig. 4(b) for grains with the same number of atoms but different shapes. In Fig. 4c, there are comparisons of S(Q) of MD simulated models of superspheres with 200 and 5000 atoms, showing different widths but also different peak positions. The complexity of the problem of correlations between diffraction effects and grain shape is additionally demonstrated in Fig. 4(d) which shows that for the grains with the same size and same shape but different atomic structures of their surfaces, the lattice relaxation proceeds differently. This is demonstrated by different widths and positions of the Bragg peaks. The above examples show that conventional crystallographic tools used for the determination of grain size and shape that apply to “ordinary” polycrystalline materials may not work for nanosize materials for which surface induced strains decide on the internal structure of the individual grains. In this work, by using nanodiamonds as a model material, we show that still, by employing Artificial Intelligence techniques through the application of ML, one can assign real nanodiamond samples to a shape category and also discern between grains terminated by different types of surfaces.
Nanodiamond models building and diffraction patterns calculations
The npcl program34 was used for the creation of a collection of models with specific types of shapes, see Fig. 1. The model shaping procedure is based on analytical formulas of superspheres (SS) given by42,43. They have been generalized so as to elongate the initial models of grains in a given direction to create super-ellipsoid shapes and control the shape of the cross section of plate/rod-like models. More details are given in Supplementary Materials.
The initial collection of 18 000 models of nanodiamond grains with all kinds of shapes was built; the largest diameter of plates was 5.5 nm and 4 nm for superspheres. The maximum height of rods was 4.5nm. The models from the database were divided into three groups of shapes based on proportions between length (L), width (W) and height (H).
The following definition of grain shapes was used:
-
superspheres/superellipsoids like shapes: \(height \approx width \approx length\) and \(H/L,H/W=0.8{\div }1.2\)
-
plate-like shapes: \(height<width,length\) and \(H/L,H/W<0.8\)
-
rod/cylinder-like shapes: \(height>width,length\) and \(H/L,H/W>1.2\)
In the case of plates and rods, the models were selected by their L to W ratio, where only models with either \(L/W<1.5\) or \(W/L<1.5\) were accepted. Among 18 000 models only 14 000 fulfilled the above criteria and only those were further used for training and grain shape identification.
The MD simulations have been done under the LAMMPS package35,36. The interactions between atoms were given by the AIREBO potential function44,45. The simulation protocol included several steps. In the first step, given model was virtually heated to 150 K and then subjected to preliminary force and energy minimization by the quickmin algorithm46. Next, the temperature was gradually increased up to 300 K, which is equal to the temperature of the experimentally measured spectra. This stage took 5000 simulation steps, what corresponds to 5ps of real time. When the sample reached a steady temperature, it was allowed to relax for 2500 steps. During the last 1000 simulation steps, instantaneous atomic positions were recorded every 10th step and stored for further calculations. The temperature was under control of the Nose-Hover thermostat47 probing every 0.1ps. For several randomly selected models, the total inner energy was controlled to observe the relaxation process. The fully relaxed models showed the energy fluctuation of less than 1%, regardless of the number of atoms and shape of the simulated grain.
The instantaneous atomic positions of the last steps served for calculation of the corresponding Pair Distribution Histogram (PDH). They were averaged to obtain the average PDH, that was used for calculations of the S(Q) structure function using the Debye scattering equation.
Selection and creation of ML classifiers
Generally, application of ML classifiers does not need assumptions about input data internal dependencies but only labeling the models, which in our case assigns diffraction data to shape categories. To get a possible even size distribution of grains with various shapes and avoid under- or over- representation of certain model sizes, the dataset was subdivided into equal width bins. Every bin contained models with the number of atoms differing by not more than 100. As many as 25 randomly selected models were left in each bin. Finally, the training set consisted of models with numbers of atoms ranging from 100 to 5000, and consisted of \(49\times 25=1225\) S(Q) patterns for each shape. To make ML classifiers resistant to numerical and experimental errors that may affect peak positions of the experimental S(Q) every pattern was loaded three times, assuming shift along the Q-axis by \(-0.05\%\), \(0\%\) and \(+0.05\%\) \(\delta Q < {\pm }0.05\)Å−1 . The S(Q) functions were normalized by setting the height of the strongest peak to 1 and then dividing the whole S(Q) pattern by its standard deviation.
We utilized three types of classifiers with significantly different principles of operation, namely Random Forest (RF), Neural Networks (NN) and eXtreme Gradient Boosting (XGB). In general, RF is a meta-classifier that may use a set of different types of sub-classifiers. Here we solely employ Decision Trees sub-classifiers. The advantage of RF is that they give a possibility of a straightforward examination of points and/or areas of the data used by classifiers for decision making, commonly referred to as relevance analysis (RA) - see Sect. “Shape determination”. The RF training requires selection and adjusting of hyperparameters, which control the classifier’s developing stage. In this study, three hyperparameters were selected for tuning the RF classifiers, namely: n_estimators, max_features, max_samples. The meaning of those hyperparameters is as follows:
-
n_estimators - number of estimators (decision trees) in a forest
-
max_features - controls the maximum number of input data features taken for the best split of a tree node,
-
max_samples - maximum number of samples to draw from the training set needed for training purposes of an estimator
Searching for the best value of hyperparameters was addressed to the GridSearch(GS) algorithm. In every training instance, the GS selected new optimum values of hyperparameters from the following sets:
-
\(n\_estimators=\lbrace 50, 100, 150, 200, 250, 300\rbrace \cdot N_{ld}\)
-
\(max\_features=\lbrace 0.8,\ 1.0,\ 1.2\rbrace \cdot \sqrt{N_{ss}}\)
-
\(max\_samples=\lbrace 0.8,\ 0.9,\ 0.95\rbrace \cdot N_{ss}\)
where: \(N_{ld}\) is the number of data points of a pattern; \(N_{ss}\) is the number of samples per shape in the current learning set.
Contrary to the RF method, NNs don’t offer the RA analysis, but they give only the final results without a possibility of tracing the rules governing the choice. The architecture of the NN that was used in this work remained unchanged for all shape recognition cases. The NN architecture was optimized by observation of precision and Matthew metric (see Sect. "The range of S(Q) function required for shape recognition") versus the number of layers and neurons. It was established that the application of the number of layers greater than 3 doesn’t improve the ML performance. The number of neurons was the same for each hidden layer but was reflection-range dependent. The NN classifiers were constructed as follows:
-
activation function of hidden layers - sigmoid
-
size of hidden layers against the peak ranges:
-
\(110\div 422\) - 1312 neurons
-
\(110\div 331\) - 1125 neurons
-
\(220\div 422\) - 1328 neurons
-
\(220\div 331\) - 875 neurons
-
\(311\div 422\) - 968 neurons
-
\(311\div 331\) - 1468 neurons
-
-
activation function of output layer - softmax
-
loss function - sparse_categorical_crossentropy
Similarly to the above, the number of epochs during training also depended on the Q-range, and it was established through experiments ranging from 300 to 600. The scheme of NN is shown in Fig. 5. In the case of shape prediction, only two hidden layers have been used. For surface identification, as many as three hidden layers were needed - Sect. "Surface structure identification".
Similarly to RF, the structure of XGB is based on Decision Trees. However, in this case, trees are not independent but built one after another. Each tree corrects errors of the previous ones. XGB also provides the RA analysis, training has been done by the GS method for the following hyperparameters:
-
\(max\_depth=3\div 5\) - controls the number of levels of the decision trees
-
\(learning\_rate=0.5\div 1.0\) - controls the speed of convergency to optimal model at each iteration
-
\(n\_estimators=30\div 60\) - similarly to RF, maximum number of samples to draw from training set needed for training purposes of an estimator
Contrary to RF, the XGB predictions have been weakly dependent on hyperparameter variations. The experimental data classification has been done for the same collection of hyperparameters but differs by the models used for training. In the case of shape classification, the max_depth, learning_rate, and n_estimators have been respectively set as: 4, 0.8 and 40; for the surface classification: 5, 0.7 and 40. It also was observed that XGB classifier training time was at least 10 times longer than RF or NN classifiers.
The above algorithms were tested to determine their level of noise resistance. The tests have been conducted on training sets by incorporating quasi-random noise that corresponds to that present in real data. It was found that even for relatively high levels of noise, which exceed 2–3 times that observed experimentally, the number of correct responses is comparable to that observed for noise free data. A higher level of noise, however, produces considerably more expanded decision trees and training time, especially in the case of XGB. Similarly, for NN, the number of neurons and hidden layers must be increased to compensate for the effect of noise. Since this procedure requires the use of a larger range of hyperparameters during optimization, it leads to an increase in the training time. Taking the above into account, to increase the reliability of shape and surface identification from experimental diffraction data, selection of low-noise Q-ranges and filtering of the examined data were applied, see Sect. “Shape determination”.
ML classification of theoretically calculated X-ray diffraction patterns
The range of S(Q) function required for shape recognition
The goal of spectrum range analysis was to find the optimum number of the data points/features needed to determine the shape of nanodiamond grains. The initial S(Q) data consisted of 85001 points, covering the range between \(Q\approx 0.98\div 15.84\)Å−1 (\(2\theta =5^{\circ }\div 90^{\circ }\)). For this full Q-range, a high and satisfactory level of information redundancy was obtained. Nevertheless, to choose a consistent data representation, introduction of additional constraints was considered to reach three goals:
(i) determining the smallest number of peaks to analyze while maintaining a low level of errors, (ii) accounting for the presence of different types of distortions (like measurement errors, noise, high-level background, etc.) present in experimental data, and (iii) preventing the possibility of under- or over- fitting of classifiers.
The evaluation of classifiers was based on Matthews Metric (MM)48,49. The MM metric returns a number between −1 and +1, where −1 denotes an inverted prediction, 0 for random responses, and +1 for error-free predictions. Contrary to other metrics, the MM takes into account all kinds of predicted answers, which are assigned to one of four groups, i.e.,: true positive (TP), true negative (TN), false positive (FP) and false negative (FN). To evaluate the outcomes of a single classification, we also utilized the precision metric (PM), which is defined as: \(PM=\frac{TP}{TP+FP}\).
The optimum range of the data used for analysis by changing the beginning and width of the Q-range was examined with reference to MM. Since experimental S(Q) peaks above \(10\)Å−1 are weak and barely seen, the Q-range was limited only to the area covering the first seven strongest peaks, Fig. 6b. The calculations were performed for several Q-ranges starting from solely the 111 peak, and then the next peaks were subsequently included in the analyzed Q-range, up to the seventh 511 peak. Next, the Q-range was gradually shortened at the low Q side, and finally only the last 511 peak was included in the examined data.
For each selected Q-range, the training and testing of the ML classifiers was performed, and the corresponding MM was calculated. Throughout this work, the data for training and validation of the classifiers were chosen from the initial set in accordance with the Pareto principle: randomly selected 80% of the models were assigned to the training set, while the remaining 20% were designated as a validation collection.
The summary of MM computations is shown in Fig. 6a. For the first four Q-ranges covering reflections 111, 111 \(\div\) 220, 111 \(\div\) 311 and 111 \(\div\) 400 one observes a continuous increase of MM, and then it remains constant when the next three reflections are included. The shape prediction classifiers achieve MM above 95% for all Q-ranges, except single reflection 511, when MM is only about 80 %.
The MM characteristics presented in Fig. 6a demonstrate that for a near perfect shape recognition, the Q-range covering the initial seven peaks is optimal; however the absence of 111 or even 220 peaks does not result in significant increase of misclassifications.
Indeed, the broad plateau of MMs plot in Fig. 6a, indicates a high level of information redundancy in the S(Q) spectrum and, therefore, any number or type or even a single peak may be used for shape analysis of nanodiamond grains. However, too small number of peaks increases the likelihood of unfavorable overfitting of ML classification algorithms, substantially limiting an effective and proper use of the available information. In consequence, it decreases the robustness of the analysis of the experimental data. This is particularly evident in the case of ML trained solely on the peak 511 where a significant decrease in the number of correct answers is observed.
The RA analysis presented in - Fig. 6b serves for further selection of the peaks and justification of choice of Q-ranges that are required according to MM analysis. The S(Q) function (blue line) is presented along with the plot of RA (red line) corresponding to full Q-range. The RA shows that the principal regions of analysis for the ML classifier are located near the peaks 111, 220, 400, 331 and 422 with a low significance of 311 peak and background. The RA also indicates that the RF algorithm examines specific areas multiple times, but the construction of the algorithm imposes that every data point is visited only once.
ML classification of experimental data
Shape determination
Experimental diffraction data were obtained for nanodiamonds synthesized from the adamantane \(C_{10}H_{16}\) molecule precursor under HP-HT conditions50. The samples with the average grain size of 1.19, 1.2, 1.28, 2.70 and 3.30 nm were selected for the present study. All samples were identified as plates terminated by (111)B surfaces with three dangling bonds31. X-ray powder diffraction was taken with the Bruker D8 Advance diffractometer equipped with an Ag-anode radiation source.
Principal data reduction was performed with PDFgetX2 software51. The experimental data originally contained 1321 points, with \(Q\approx 0.98 \div 21.6\)Å−1\((2\theta =5^{\circ }\div 150^{\circ })\), which were mapped to the theoretical data points by cubic spline interpolation. The experimental S(Q) functions were subjected to noise reduction by employing the 5-rank, Savitzky-Golay filter and background correction, see in Fig. 7. More details on background correction are given in Supplementary Materials, Fig. S1 and Fig. S2. Next, for the shape classification processing, the points above \(Q=9.0\)Å−1 were excluded (shaded area in Fig. 7) due to insufficient data quality.
Based on the full collection of theoretical S(Q) functions described in Sect. "The range of S(Q) function required for shape recognition", using the same protocol as above, the ML classifiers were trained utilizing RF, NN and XGB algorithms and applied for six different sets of Bragg peaks. For all classifiers, the MM metrics were above 90%. For each group of generated classifiers, the one with the highest MM was selected and used for shape prediction of the experimental data.
The MM metrics for classifiers within the selected groups of peaks are shown in Fig. 8a. The MM values range from 96% to 99% for RF/NN classifiers and decrease with reducing the Q-range. For XGB classifiers, higher values of MM metric are observed and they show weak dependence on the number of peaks under examination.
The results of shape recognition from the experimental data by ML classifiers based on the RF algorithm are shown in Fig. 8b. The shape of the largest, 3.3nm grains is recognized as plates by all classifiers. For medium-sized samples, there are two types of shapes being recognized, while the dominating shape is plate. The grains of the smallest size samples are recognized as either, plate- or rod-like shapes, depending on the data range analyzed. In this case, the rod-like shapes are recognized by ML classifiers trained on the data with 111 peak present, plates are recognized by classifiers with 111 reflection excluded. An exception is observed for the two smallest grain samples, where rod-like shapes are identified by classifiers when 311 and 422 peaks are taken into account.
The results of shape classification by NN are shown in Fig. 8c. There is a full agreement between the RF and NN classifiers for three sets of Bragg peaks, i.e.: \(220\div 422\), \(220\div 331\), \(311\div 331\), for which the plate-like shapes are recognized. Also, similar result has been found for the \(311\div 422\) peaks group, while the supersphere shapes are detected for the sample containing the smallest grains of nanodiamods.
Similarly to above, the shapes predicted by XGB, Fig. 8d, show the rods for S(Q) data with the 111 peak included; if the 111 peak is ignored, plate-like shapes dominate. Furthermore, the XGB classifiers indicated superspheres for two cases, but these findings don’t cover predictions of RF/NN classifiers. They are identified for the largest and medium-sized models when S(Q) data start from 331 peak.
Surface structure identification
The results presented in Sect. “Shape determination” indicate that all types of classifiers are capable of recognizing the shape of nanodiamonds and are fully consistent with our previous findings11, so it allowed studies to be extended by plate surface analysis..
There are two subtypes of (111) surface denoted in this work as A and B, which differ by the number of dangling bonds of surface atoms, as shown Fig. 9. Accordingly, the plate-shaped nanodiamonds may be of 3 types, which are terminated A-A, B-B, or A-B pairs of 111 surfaces31,. Subsequently, the plate-like models from the database have been selected and assigned to a suitable group, in accordance with the methodology outlined in the preceding Sections. The number of models available for training was smaller in this case – only plates were needed – and consequently the selection and training protocols were modified. The number of models in a bin was set to 15 and the total number of models in each category was 750. To obtain high values of MM metric, c.f. Fig. 10b, and to prevent experimental/numerical/overfitting errors, every S(Q) pattern was taken 5 times, by shifting the given pattern along the Q-axis, see Sect. "Selection and creation of ML classifiers".
The training of RF and XGB classifiers proceeds similarly to the shape recognitions case; see Sect. “Shape determination”. The generation of NN classifiers here required a modification by its extension to three deep layers, every one with the same number of units but dependent on the size of the data, see Sect. "Selection and creation of ML classifiers". An activation function was based on the Gaussian Error Linear Unit (GELU) method with weight initialization proposed by52. The Batch Normalization technique was used to normalize the results of each layer (except for the outermost one). A strong overfitting phenomenon was observed when training was done for a very narrow Q-range of S(Q) data with \(311\div 422\) peaks and \(311\div 331\) only. To correct for this, the Dropout technique was employed, with a dropout rate of 20%.
The MM metrics for RF/NN/XGB classifiers are shown in Fig. 10a. The MM values of all classifiers are about 95% when at least 4 peaks are taken, starting from either 111 or 220. A significant decrease of MM is observed only for the pattern range starting from the 311 peak, especially for NN. In the majority of cases the MM values for RF classifiers are slightly higher than those for NN and XGB.
The results of the application of ML algorithms to experimental patterns are presented in Fig. 10b-d and show classifiers recognize the B-B surface configuration as dominating if the 111 peak is ignored. This result also conforms to the shape recognition case when inclusion of the 111 peak strongly affects the result of shape identification, see Sect. “Shape determination”.
Discussion
Although information on shape of nanograins is definitely contained in the diffraction patterns, a combination of multiple factors affecting the shape of the diffraction data effectively excludes shape determination of nanocrystals with the use of conventional analytical tools dedicated to polycrystalline materials.
A critical problem that appears when asking for shape recognition of nanograins comes from the fact that the atomic structure of an individual grain depends on its size, shape and specific structure of its surfaces. In this work we show that application of ML to shape analysis of nanodiamonds allows one to overcome these difficulties and may be an effective tool for identification of the shape of nanograins. The crucial stage of grain shape recognition is the collection of reference diffraction data, which in our case are theoretical patterns calculated for MD simulated models of grains. An ML is able to effectively analyze the network of interrelations between various parts and features of the diffraction patterns of grains with various shapes and surfaces. ML classifiers are capable of differentiating the objects, even if a thorough comparison of their diffraction patterns reveals only hard to quantify fuzzy differences.
The results presented in the paper demonstrate that the performance of the ML classifiers is strongly dependent on statistics hidden in the collections of models used for training. For example, the predictive abilities diminish significantly in the cases where the training sets do not cover the full range of the analyzed crystallite sizes. Thus the rules for grain shape determination must also take into account the possibly broad range of the grain sizes. This conclusion illustrates a well-known issue called out-of-distribution degradation53,54,55.
Analysis of MM metrics derived for the examined cases shows that classifiers require the delivery of diverse statistics to efficiently process information contained in the diffraction patterns. The shape classifications of the modeled particles indicate that ML classifiers are capable of following the rules imposed on the internal atomic structure of individual grains by the strains developed from the grain surfaces. It is also apparent that ML algorithms indirectly incorporate in the analysis the deviation of the atomic structure of nanograins from the perfect crystal lattice which varies depending on the specific atomic structure of the grain surfaces. This feature specific to a few nm-sized grains is of key importance for identification of atomic architecture of basal planes of nanodiamond plates, which in our case are terminated by carbons with three dangling bonds.
An independent check of ML shape classification of the experimental data reported in Sect. “Shape determination” is provided by the results presented in11. Based on very detailed numerical analysis it was demonstrated, that nanodiamonds synthesized from chloroadamantane take the plate-like shape, terminated on both sides by B-type surfaces. Exactly such results are provided by both RF, XGB and NN classifiers trained on the diffraction patterns not containing the 111 peak, with the exception of the smallest size grains and classifiers trained on very limited Q-ranges. It is apparent that the presence of 111 reflection in the experimental diffraction patterns under examination confuses the ML classifiers. The reason for such deficiency may be a combination of instrumental and calculation errors. The experimental diffraction data were obtained at the at-home diffractometer, where at low diffraction angles the direct beam divergence, the opening of the incoming and receiving slits, and the detector geometry are difficult to precisely account for. Also, the intensity correction factors like Lorentz-Polarization are large and quickly changing at low diffraction angles, and so computational errors may result in incorrect intensity and shape of the 111 peak.
A growing number of research teams are addressing the topic of crystal structure parameters identification by ML algorithms. They predominantly employ neural networks for analyzing selected topics of significance in this domain, such as lattice parameter estimation 20 and pattern classification56,57. This paper further explores these topics. It illustrates that utilizing three distinct ML algorithms instead of a single one enables the attainment of greater confidence in the outcomes.
The reported identification of the grain shape from the experimental data may be considered satisfactory, although it is not fully unequivocal. Uncertainties concerning shape identification are not unexpected since they can result from both, the samples’ characteristics, e.g., particle size distribution and shape distribution, and the experimental deficiencies, e.g., statistical noise. One must also note that while analyzing real nanomaterials, one searches for a preferred or most frequently occurring particle shape. Depending on the particular sample, the answer may not be necessarily definite or univocal, e.g., if none of the shapes occurs more frequently than the others.
Summary
This work presents an analysis of the shape of nanodiamond grains based on diffraction data with the application of Machine Learning techniques. The shape of nanodiamond grains was recognized based on the database created from theoretical structure functions S(Q) calculated for models of the grains after their relaxation using MD simulations.
The classifiers were generated using three distinct methodologies, namely Random Forest, Neural Networks, and Extreme Gradient Boosting. It is demonstrated that the above methods produce ML classifiers with similarly high values of correct classifications. RF/NN/XGB classification can discern between grains with the same shape but different surfaces, namely show that nanodiamond plates are covered by 111 surfaces terminated by atoms with three dangling bonds.
The paper demonstrates the high degree of shape information redundancy in the diffraction data. It allowed us to safely reduce the actual range of the data identification and remove the pattern areas most prone to experimental deficiencies without negative impact on shape and surface predictions.
Data availability
All data are available from: Python scripts - https://github.com/kskrobas/shapeAIRecog. ML training databases - https://unipress.waw.pl/nanopdf/. Software for diamond models building and diffraction spectra calculations https://github.com/kskrobas/npcl64.
Abbreviations
- AIREBO:
-
Adaptive Intermolecular Reactive Bond Order
- BCA:
-
Background correction algorithm
- FN:
-
False negative
- FP:
-
False positive
- FST:
-
Fourier sine transform
- GELU:
-
Gaussian Error Linear Unit
- GS:
-
GridSearch algorithm
- H:
-
Height
- HCP:
-
Hexagonal close-packed
- L:
-
Length
- LAMMPS:
-
Large-scale Atomic/Molecular Massively Parallel Simulator
- ML:
-
Machine Learning
- MD:
-
Molecular Dynamics
- MM:
-
Matthews Metric
- NN:
-
Neural Networks
- PDF:
-
Pair Distribution Function
- PDH:
-
Pair Distribution Histogram
- PM:
-
Precision metric
- RA:
-
Relevance analysis
- RF:
-
Random Forest
- SE:
-
Superellipsoid
- SS:
-
Supersphere
- TN:
-
True negative
- TP:
-
True positive
- W:
-
Width
- XGB:
-
Extreme Gradient Boosting
References
Palosz, B., Grzanka, E., Gierlotka, S. & Stelmakh, S. Nanocrystals: Breaking limitations of data analysis. Zeitschrift für Kristallographie - Crystalline Materials 225(12), 588–598. https://doi.org/10.1524/zkri.2010.1358 (2010).
Skrobas, K., Stelmakh, S., Gierlotka, S. & Palosz, B. F. Nanopdf64: software package for theoretical calculation and quantitative real-space analysis of powder diffraction data of nanocrystals. J. Appl. Cryst 50, 1821–1829 (2017).
Cervellino, A., Frison, R., Bertolotti, F. & Guagliardi, A. Debussy 2.0: the new release of a debye user system for nanocrystalline and/or disordered materials. Journal of Applied Crystallography 48, 2026–2032. https://doi.org/10.1107/S1600576715020488 (2015).
Bertolotti, F. et al. A total scattering Debye function analysis study of faulted Pt nanocrystals embedded in a porous matrix. Acta Crystallographica Section A 72, 632–644. https://doi.org/10.1107/S205327331601487X (2016).
Moscheni, D. et al. Size-dependent fault driven relaxation and faceting in zincblende CdSe colloidal quantum dots. ACS Nano 12, 12558–12570 (2018).
Masadeh, A. S. et al. Quantitative size-dependent structure and strain determination of CdSe nanoparticles using atomic pair distribution function analysis. Phys. Rev. B 76, 115413. https://doi.org/10.1103/PhysRevB.76.115413 (2007).
Yang, X. et al. Confirmation of disordered structure of ultrasmall CdSe nanoparticles from X-ray atomic pair distribution function analysis. Phys. Chem. Chem. Phys. 15, 8480–8486. https://doi.org/10.1039/C3CP00111C (2013).
Stelmakh, S., Skrobas, K., Gierlotka, S. & Palosz, B. Application of PDF analysis assisted by MD simulations for determination of the atomic structure and crystal habit of CdSe nanocrystals. Journal of Physics: Condensed Matter 30, 345901 (2018).
Stelmakh, S., Skrobas, K., Gierlotka, S. & Palosz, B. Atomic structure of nanodiamond and its evolution upon annealing up to 1200 C: Real space neutron diffraction analysis supported by md simulations. Diamond and Related Materials 93, 139–149. https://doi.org/10.1016/j.diamond.2019.02.004 (2019).
Stelmakh, S., Skrobas, K., Gierlotka, S., Vogel, S. C. & Palosz, B. Atomic structure and grain shape evolution of nanodiamond during annealing in oxidizing atmosphere from neutron diffraction and MD simulations. Diamond and Related Materials 111, 108177. https://doi.org/10.1016/j.diamond.2020.108177 (2020).
Stelmakh, S., Skrobas, K., Gierlotka, S. & Palosz, B. Structure of plate-shape nanodiamonds synthesized from chloroadamantane-are they still diamonds?. Journal of Physics: Condensed Matter 33, 175002. https://doi.org/10.1088/1361-648X/abe26a (2021).
Stelmakh, S., Skrobas, K., Stefanska-Skrobas, K., Gierlotka, S. & Palosz, B. Distortion of SiC lattice induced by carbon-coating on (100) and (111) surfaces - ab-initio and molecular dynamics study. Surface Science 728, 122179. https://doi.org/10.1016/j.susc.2022.122179 (2023).
Stelmakh, S., Skrobas, K., Gierlotka, S. & Palosz, B. Formation of grain boundaries in nanocrystalline sic ceramics examined by powder diffraction supported by MD simulations. Journal of Alloys and Compounds 978, 173474 (2024).
Ekeberg, T. Introduction to the virtual collection of papers on Artificial neural networks: applications in X-ray photon science and crystallography. Journal of Applied Crystallography 57(1), 1–2. https://doi.org/10.1107/S1600576723010476 (2024).
Chen, Z. et al. Machine learning on neutron and x-ray scattering and spectroscopies. Chemical Physics Reviews 2, 031301 (2021).
Wang, B.Y., Yager, K., Yu, D.T., & Hoai, M.: X-ray scattering image classification using deep learning. IEEE Winter Conference on Applications of Computer Vision, 697–704 (2017)
Ke, T. W. et al. A convolutional neural network-based screening tool for x-ray serial crystallography. Journal of Synchrotron Radiation 25, 655–670 (2018).
Nawaz, S. et al. Explainable machine learning for diffraction patterns. Journal of Applied Crystallography 56, 1494–1504 (2023).
Vecsei, P. M., Choo, K., Chang, J. & Neupert, T. Neural network based classification of crystal symmetries from x-ray diffraction patterns. Phys. Rev. B 99, 245120. https://doi.org/10.1103/PhysRevB.99.245120 (2019).
Chitturi, S. R. et al. Automated prediction of lattice parameters from x-ray powder diffraction patterns. Journal of Applied Crystallography 54, 1799–1810 (2021).
Corriero, N., Rizzi, R., Settembre, G., Buono, N. D. & Diacono, D. CrystalMELA: a new crystallographic machine learning platform for crystal system determination. Journal of Applied Crystallography 56(2), 409–419. https://doi.org/10.1107/S1600576723000596 (2023).
Prasianakis, N. I. AI-enhanced X-ray diffraction analysis: towards real-time mineral phase identification and quantification. IUCrJ 11, 647–648. https://doi.org/10.1107/S2052252524008157 (2024).
Billinge, S. J. L. & Proffen, T. Machine learning in crystallography and structural science. Acta Crystallographica Section A 80, 139–145. https://doi.org/10.1107/S2053273324000172 (2024).
Roberts, G., Nieh, M.-P., Ma, A. W. & Yang, Q. Automated structural analysis of small angle scattering data from common nanoparticles via machine learning. Digital Discovery 4, 1467–1477. https://doi.org/10.1039/D5DD00059A (2025).
Monge, N., Deschamps, A. & Amini, M.R.: Automated selection of nanoparticle models for small-angle X-ray scattering data analysis using machine learning. Acta Crystallographica Section A 80, 202–212 (2024) https://doi.org/10.1107/S2053273324000950
Allara, L., Bertolotti, F. & Guagliardi, A. A deep learning approach for quantum dots sizing from wide-angle x-ray scattering data. npj Computational Materials 10, 54 (2024).
Boruah, A. & Saikia, B. K. Synthesis, characterization, properties, and novel applications of fluorescent nanodiamonds. Journal of Fluorescence 32, 863–885 (2022).
Terranova, M. L., Orlanducci, S., Rossi, M. & Tamburri, E. Nanodiamonds for field emission: state of the art. Nanoscale 7, 5094–5114. https://doi.org/10.1039/C4NR07171A (2015).
Xiao, J., Li, J. L., Liu, P. & Yang, G. W. A new phase transformation path from nanodiamond to new-diamond via an intermediate carbon onion. Nanoscale 6, 15098–15106. https://doi.org/10.1039/C4NR05246C (2014).
Skrobas, K., Stelmakh, S., Gierlotka, S. & Palosz, B. A model of density waves in atomic structure of nanodiamond by molecular dynamics simulations. Diamond and Related Materials 91, 1–14. https://doi.org/10.1016/j.diamond.2018.10.020 (2019).
Stelmakh, S., Skrobas, K., Gierlotka, S. & Palosz, B. The shape and surface structure of detonation nanodiamond purified in oxidizing chemical environment. Diamond and Related Materials 113, 108286. https://doi.org/10.1016/j.diamond.2021.108286 (2021).
Skrobas, K. MD models of diamond grains and its X-ray diffraction patterns (2024). http://www.unipress.waw.pl/nanopdf/
Egami, T., & Billinge, S. Underneath the Bragg Peaks: Structural Analysis of Complex Materials. Elsevier, 2nd ed, Amsterdam (2012)
Skrobas, K. Program for building nanocrystals models and diffraction spectra calculations (2024).
Plimpton, S. Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics 117, 1–19 (1995).
Thompson, A. P. et al. Lammps - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications 271, 1–34 (2022).
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).
Chollet, F., et al. Keras. GitHub (2015). https://github.com/fchollet/keras
Chen, T., et al. xgboost: Extreme Gradient Boosting (2025). https://github.com/dmlc/xgboost
Chen, T., & Guestrin, C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794 (2016) https://doi.org/10.1145/2939672.2939785
Skrobas, K. Python scripts for shape recognition by AI algorithms (2024). https://github.com/kskrobas/shapeAIRecog
Onaka, S. Superspheres: Intermediate shapes between spheres and polyhedra. Symmetry 4, 336–343 (2012).
Miyazawa, T., Arateke, M. & Onaka, S. Superspherical-shape approximation to describe the morphology of small crystalline particles having near-polyhedral shapes with round edges. Journal of Mathematical Chemistry 50, 249–260. https://doi.org/10.1007/s10910-011-9909-1 (2012).
Stuart, S. J., Tutein, A. B. & Harriso, J. A. A reactive potential for hydrocarbons with intermolecular interactions. J. Chem. Phys. 112, 6472 (2000).
Stuart, S. J., Knippenberg, M. T., Kum, O. & Krstic, P. S. Simulation of amorphous carbon with a bond-order potential. Phys. Scr. 124, 58–64 (2006).
Jonsson, H., Mills, G., & Jacobsen, K.W. Classical and Quantum Dynamics in Condensed Phase Simulations Edited by B. J. Berne, G. Ciccotti, and D. F. Coker, pp. 385–404. World Scientific, Singapore (1998)
Hoover, W. G. Canonical dynamics: Equilibrium phasespace distributions. Phys. Rev. A 31, 1695–1697 (1985).
Matthews, B. W. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405(2), 442. https://doi.org/10.1016/0005-2795(75)90109-9 (1975).
Chicco, D. & Jurman, G. A statistical comparison between matthews correlation coefficient (mcc), prevalence threshold, and fowlkes-mallows index. Journal of Biomedical Informatics 144, 104426. https://doi.org/10.1016/j.jbi.2023.104426 (2023).
Ekimov, E. A. et al. High-pressure synthesis of nanodiamonds from adamantane: Myth or reality?. Diamond and Related Materials 103, 107718 (2020).
Qiu, X., Thompson, J. W. & Billinge, S. J. L. PDFgetX2: a GUI-driven program to obtain the pair distribution function from X-ray powder diffraction data. Journal of Applied Crystallography 37(4), 678. https://doi.org/10.1107/S0021889804011744 (2004).
He, K., Zhang, X., Shaoqing, R., & Jian, S. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034 (2015). https://doi.org/10.1109/ICCV.2015.123
Bayram, F., Ahmed, B. S. & Kassler, A. From concept drift to model degradation: An overview on performance-aware drift detectors. Knowledge-Based Systems 245, 108632 (2022).
Nagarajan, V., Andreassen, A., & Neyshabur, B. Understanding the failure modes of out-of-distribution generalization. ArXiv (2020)
Kang, K., Setlur, A., Tomlin, C., & Levine, S. Deep Neural Networks Tend To Extrapolate Predictably (2024). https://arxiv.org/abs/2310.00873
Assalauova, D., Ignatenko, A., Isensee, F., Trofimova, D. & Vartanyantsa, I. A. Classification of diffraction patterns using a convolutional neural network in single-particle-imaging experiments performed at X-ray free-electron lasers. J. Appl. Crystallogr. 55, 444–454 (2022).
Timmermann, S., et al. Automated matching of two-time x-ray photon correlation maps from phase-separating proteins with cahn-hilliard-type simulations using auto-encoder networks. J. Appl. Cryst. 55, 751–757 (2022)
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
KS - conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing-draft & editing, visualization, supervision. KSS - methodology, validation, formal analysis, writing-editing. SS - validation, formal analysis, resources, writing-editing, funding acquisition. SG - validation, formal analysis, resources, writing-review & editing. BP - conceptualization, validation, formal analysis, writing-review & editing.
Corresponding author
Ethics declarations
Ethical approval
Not applicable.
Consent for participation and publication
All authors consented to contribute in publication.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Skrobas, K., Stefańska-Skrobas, K., Stelmakh, S. et al. Application of machine learning for nanodiamonds shape and surface classification based on X-ray pattern analysis. Sci Rep 15, 40304 (2025). https://doi.org/10.1038/s41598-025-24143-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-24143-z












