Introduction

The diffraction data analysis of nanocrystalline materials must be supported by a dedicated crystallographic software1,2. The reason for that is an increased number of degrees of freedom of the surface atoms with a decrease of grain size, what has a pronounced effect on changes of the arrangement of nanograin atoms in comparison to the bulk material. In consequence in nanocrystals there occur deviations of atomic positions from those of a perfect crystal lattice. The origin of that behavior is appearance of surface-induced strains that penetrate a considerable fraction of the nanocrystals’ volume.

Complete information on the size, shape and atomic structure of nanocrystals can be derived from diffraction data only by application of sophisticated procedures involving computation of model powder patterns using Debye scattering equation by software specifically dedicated to nanocrystalline materials e.g.3,4. The analysis can be done using both the diffraction pattern e.g.4,5 or the Pair Distribution Function, e.g.,6,7. In6,7is reported that the lattice of nanocrystals 2nm in size and smaller is considerably different from that of larger/bulk crystals. In the past we addressed this issue by using models of nanocrystals relaxed by Molecular Dynamics simulations and confirmed a good agreement between experimental and simulated data for CdSe8, diamond9,10,11, and SiC 12,13. Yet, this kind of analysis is very time consuming and it can be used only occasionally. In consequence, it has a little chance to become a standard method recommended for structural characterization of nanomaterials.

The breakthrough in information processing methods made in the last decade came with the introduction of Machine Learning (ML) technique. In the present paper, we demonstrate results of application of the ML technique to the elaboration of diffraction data of nanocrystals in order to learn about nanoparticle shapes and surfaces. The superiority of ML over other numeric techniques for statistical analysis comes from the ability to fully independently discover relationships between objects through the training stage.

Crucial for the ML algorithm performance is the amount of the data available for processing and training. When applying ML for the solution of crystallographic problems, those multiple datasets are either directly produced during the experiment or are obtained from existing databases14,15. In the so-called serial crystallography, multiple “single shot” diffraction patterns are collected at the pulsed radiation source, and the ML technique is a method of choice for data categorization and the aggregation of meaningful information16,17,18. For rapid determination of crystal symmetry or even indexing of the patterns of unknown substances by ML classifiers, a suitable training dataset may be pulled from over a million identified crystal structures stored in open and commercial databases19,20,21. A comprehensive list of the recent works on the application of ML in classical crystallography can be found in the review articles14,22,23.

Diffraction on nanocrystals has also been addressed in terms of ML techniques. It has been demonstrated that sizes and shapes of nanocrystals could be determined using small angle scattering data, e.g.24,25 as well as wide-angle data, e.g.26. It must be stressed, however, that recently solely the Neural Network algorithms are being used, making it impossible to identify features of the diffraction data that are relevant in the analysis.

In principle, nanoparticles may be conveniently visualized by high resolution electron microscopy27,28,29. However, recovering 3D shape and size information from 2D images may not be reliable, especially when size distribution is present. Due to a strong dependence of the width of Bragg peaks on the crystallite size in the nanoregion, powder diffraction might seem to be a convenient tool to determine their size. Provided the peaks are well separated, the shape of crystallites might be determined by measuring relative intensities and widths of the peaks corresponding to different directions in the crystal lattice. However, this is not the case for the nanoparticles of only 1–5 nm in size for which the above simple relationships between size and shape and Bragg reflections are not fulfilled. A single nanograin is not a small, perfect single crystal, but its internal structure is both size and shape dependent. Its diffraction pattern reflects the presence of intrinsic strains existing in individual nanoparticles, e.g.30. Since the presence of internal strains has a pronounced effect on both the width, positions and relative intensities of Bragg peaks a straightforward analysis of individual Bragg peaks is not feasible for determining the size and shape of a few nm-sized nanocrystalline materials.

Our study is dedicated to those extremely small nanocrystals and here we present results of recognition of nanoparticle shapes from diffraction data with an ML algorithms based on a supervised learning method.

The experimental diffraction data required for training, sufficient for statistical analysis needed for nanoparticles’ shape recognition, are not available since nanomaterials of precisely controlled particle sizes and shapes are still scarce and none are available in sufficiently many size-shape modifications. Training and testing of ML classifiers, which we apply in this work for nanodiamonds was therefore entirely based on simulated diffraction data. In a series of works, we showed that the real structure of nanograins is well reproduced by Molecular Dynamic simulations (MD). Based on MD simulated models of CdSe, diamond and SiC and comparing theoretical to experimental diffraction data, we were able to derive information on very fine details of nanocrystals’ atomic structure as well as determine the orientation of the crystal terminating surfaces and crystal habit (CdSe8, diamond9,10,11 ,31 and SiC13). That work required, however, a tedious comparison between experimental diffraction data and theoretical diffraction patterns of MD-simulated atomic models of nanograins. In the current report, we designate the task to ML algorithms. To examine the applicability of ML algorithm to grain shape recognition, we selected diamond as a model system for its structural simplicity and similarity to a wide class of substances with hexagonal close-packed (HCP) structure that are of importance for nanoscience, e.g. CdSe, ZnO, SiC, GaN, GaP. Since we are in possession of extremely small nanodiamonds which we already measured and characterized by PDF analysis11, this work serves for testing applicability of ML algorithms for analysis of the actual experimental diffraction data and, at the same time, for verification of our previous results.

The grain models generated for building the training database were composed of between 100 and 5000 atoms, i.e. their sizes were between 1 nm and 4 nm for 3D shapes. Three shape categories were considered, namely rods (1D), plates (2D), and superspheres/superellipsoids (3D), see Fig. 1. The term “superspheres” is used here to describe the nearly isotropic crystallites following the “supersphere equation” used in this work to create appropriate models – see Supplementary Materials . The models that met the requirements of ML are available to download from32.

Fig. 1
figure 1

Examples of nanodiamond models generated by supershape/superellipsoid procedures of npcl software package: (a) plate, (b) supersphere, and (c) rod-like shapes.

Nanograin models with initially perfect diamond lattices were the subject of MD simulations. The MD calculations introduce collective thermal motions, i.e., phonons, and also the surface induced lattice strains and so provide a realistic approximation of the atomic structure of actual nanoparticles at \(T=300K\). Based on them, X-ray powder diffraction patterns were calculated using the Debye formula33.

The diffraction data under elaboration were structure functions S(Q), where Q is the module of the scattering vector. S(Q) is essentially the diffraction pattern, and contains only the information on crystal structure. It is cleaned of the scattering processes characteristics and sample chemical composition. Another feature relevant for structure analysis is that there is no need to analyze the full range of Q-values due to the high information redundancy of diffraction data. We show that data selection can be done for almost any arbitrary range of S(Q) data to obtain low-error predictions. This is an important feature due to its application in experimental data processing, whose characteristics differ from theoretically obtained ones, and selection may be necessary.

Prior to ML-based data, processing the irrelevant signals from experimental data should be removed20. Here, it has been done by PDFgetX2 software in the way routinely used for the PDF analysis . Next the data were subjected to the removal of the high-frequency noise and the further background correction. Details are given in Sect. “Shape determination” and in Supplementary Materials.

The software package used for model building and diffraction data calculation was the npcl program34. This program is a successor of the NanoPDF642 program with greatly enhanced capabilities and operates on both Windows® and Unix/Linux machines. The MD simulations were run under the LAMMPS software package35,36; The three ML classification algorithms used were: Random Forest, from the Scikit-Learn v. 1.0.2 package37; Neural Network from the Keras package38 and Extreme Gradient Boosting from39,40. The Python scripts for ML training and validation, experimental data processing, and shape recognition are placed at41.

Both theoretical and experimental diffraction data were X-ray powder patterns obtained for \(\lambda =0.561\)Å wavelength (silver anode), in the Q-range up to \(Q^{Ag}_{max}\approx 21\)Å−1.

Specific problems of diffraction data analysis of nanograins

Conventional analysis of the size and shape of grains of a polycrystalline material refers to the width and relative intensities of Bragg reflections. Considering only powder samples with no preferred orientation and no externally induced strains, the simple rules are such that broader Bragg peaks mean smaller grain dimensions, while the differences in peak widths and heights may indicate anisotropy of the grain shape. Such simple rules work well only if the peaks in the pattern are well separated, i.e., for particles of about 5 nm and larger but not for smaller nanograins.

For high symmetry structures, such as diamond, withdrawal of shape information from Bragg peaks is even more challenging what is demonstrated in Fig. 2. The figure shows that even for the [111] direction, which directly corresponds to the height of a plate/rod-shaped grain, the information on the grain height is contained in only a portion of the intensity and shape of the measured 111 peak. This is because the other three out of four equivalent [111] directions that contribute to the (111) peak are inclined regarding the given axis of the plate/rod particle. This concerns also other reflections that combine information on dimensions of a particle in multiple directions, e.g. [220] and \([\bar{2}20]\), Fig. 2.

Fig. 2
figure 2

Relationships between selected directions for plate-like model.

Comparison of S(Q) plots of three types of grain shapes presented in Fig. 3 shows that there are no strict correlations between relative Bragg intensities and specific grain shapes. For instance, one may expect that for the plate-shaped grain, the relative intensity of 220/111 reflections should always be larger than that for an isotropic supersphere grain and it should be smaller than that measured for rod-shaped grains. Fig. 3 shows that this rule is fulfilled for the grains with about 5000 atoms, but it is not fulfilled for the smaller grains: (i) for rods, the 111/220 ratio is about 1.2, it is only a little larger than 1 for grain with 1500 atoms, but it is smaller than 1.0 for rods with 200 atoms; (ii) for plates with 5000 atoms, the 111/220 ratio is about 0.8; it is only a little smaller than 1 for grains with 1500 atoms, and for the smallest plate, it is about 1.0. An obvious conclusion that follows from Fig. 3 is that relative intensities of Bragg reflections alone measured for a few nm nanograins do not contain a unique and sufficient information on their actual shape.

Fig. 3
figure 3

Structure function of MD simulated models and its dependency on size and shape.

The other, even more difficult problem with identification of shape that is specific for nanograins is that their internal structure evolves with the size, shape and surface structure. All these factors have different effects on diffraction patterns. In Fig. 4a, S(Q) of the model with a perfect crystal lattice is compared to S(Q) of this model after MD relaxation. It demonstrates that both relative intensities and also peak positions change due to lattice relaxation induced by MD simulation. A similar effect is presented in Fig. 4(b) for grains with the same number of atoms but different shapes. In Fig. 4c, there are comparisons of S(Q) of MD simulated models of superspheres with 200 and 5000 atoms, showing different widths but also different peak positions. The complexity of the problem of correlations between diffraction effects and grain shape is additionally demonstrated in Fig. 4(d) which shows that for the grains with the same size and same shape but different atomic structures of their surfaces, the lattice relaxation proceeds differently. This is demonstrated by different widths and positions of the Bragg peaks. The above examples show that conventional crystallographic tools used for the determination of grain size and shape that apply to “ordinary” polycrystalline materials may not work for nanosize materials for which surface induced strains decide on the internal structure of the individual grains. In this work, by using nanodiamonds as a model material, we show that still, by employing Artificial Intelligence techniques through the application of ML, one can assign real nanodiamond samples to a shape category and also discern between grains terminated by different types of surfaces.

Fig. 4
figure 4

Comparison of S(Q) functions and their dependency on (a) type of theoretical method of calculations, (b) model shape, (c) model size and (d) subtype of (111) plane. Insets show the zoom of 111 reflection, triangles mark the position of a peak.

Nanodiamond models building and diffraction patterns calculations

The npcl program34 was used for the creation of a collection of models with specific types of shapes, see Fig. 1. The model shaping procedure is based on analytical formulas of superspheres (SS) given by42,43. They have been generalized so as to elongate the initial models of grains in a given direction to create super-ellipsoid shapes and control the shape of the cross section of plate/rod-like models. More details are given in Supplementary Materials.

The initial collection of 18 000 models of nanodiamond grains with all kinds of shapes was built; the largest diameter of plates was 5.5 nm and 4 nm for superspheres. The maximum height of rods was 4.5nm. The models from the database were divided into three groups of shapes based on proportions between length (L), width (W) and height (H).

The following definition of grain shapes was used:

  • superspheres/superellipsoids like shapes: \(height \approx width \approx length\) and \(H/L,H/W=0.8{\div }1.2\)

  • plate-like shapes: \(height<width,length\) and \(H/L,H/W<0.8\)

  • rod/cylinder-like shapes: \(height>width,length\) and \(H/L,H/W>1.2\)

In the case of plates and rods, the models were selected by their L to W ratio, where only models with either \(L/W<1.5\) or \(W/L<1.5\) were accepted. Among 18 000 models only 14 000 fulfilled the above criteria and only those were further used for training and grain shape identification.

The MD simulations have been done under the LAMMPS package35,36. The interactions between atoms were given by the AIREBO potential function44,45. The simulation protocol included several steps. In the first step, given model was virtually heated to 150 K and then subjected to preliminary force and energy minimization by the quickmin algorithm46. Next, the temperature was gradually increased up to 300 K, which is equal to the temperature of the experimentally measured spectra. This stage took 5000 simulation steps, what corresponds to 5ps of real time. When the sample reached a steady temperature, it was allowed to relax for 2500 steps. During the last 1000 simulation steps, instantaneous atomic positions were recorded every 10th step and stored for further calculations. The temperature was under control of the Nose-Hover thermostat47 probing every 0.1ps. For several randomly selected models, the total inner energy was controlled to observe the relaxation process. The fully relaxed models showed the energy fluctuation of less than 1%, regardless of the number of atoms and shape of the simulated grain.

The instantaneous atomic positions of the last steps served for calculation of the corresponding Pair Distribution Histogram (PDH). They were averaged to obtain the average PDH, that was used for calculations of the S(Q) structure function using the Debye scattering equation.

Selection and creation of ML classifiers

Generally, application of ML classifiers does not need assumptions about input data internal dependencies but only labeling the models, which in our case assigns diffraction data to shape categories. To get a possible even size distribution of grains with various shapes and avoid under- or over- representation of certain model sizes, the dataset was subdivided into equal width bins. Every bin contained models with the number of atoms differing by not more than 100. As many as 25 randomly selected models were left in each bin. Finally, the training set consisted of models with numbers of atoms ranging from 100 to 5000, and consisted of \(49\times 25=1225\) S(Q) patterns for each shape. To make ML classifiers resistant to numerical and experimental errors that may affect peak positions of the experimental S(Q) every pattern was loaded three times, assuming shift along the Q-axis by \(-0.05\%\), \(0\%\) and \(+0.05\%\) \(\delta Q < {\pm }0.05\)Å−1 . The S(Q) functions were normalized by setting the height of the strongest peak to 1 and then dividing the whole S(Q) pattern by its standard deviation.

We utilized three types of classifiers with significantly different principles of operation, namely Random Forest (RF), Neural Networks (NN) and eXtreme Gradient Boosting (XGB). In general, RF is a meta-classifier that may use a set of different types of sub-classifiers. Here we solely employ Decision Trees sub-classifiers. The advantage of RF is that they give a possibility of a straightforward examination of points and/or areas of the data used by classifiers for decision making, commonly referred to as relevance analysis (RA) - see Sect. “Shape determination”. The RF training requires selection and adjusting of hyperparameters, which control the classifier’s developing stage. In this study, three hyperparameters were selected for tuning the RF classifiers, namely: n_estimators, max_features, max_samples. The meaning of those hyperparameters is as follows:

  • n_estimators - number of estimators (decision trees) in a forest

  • max_features - controls the maximum number of input data features taken for the best split of a tree node,

  • max_samples - maximum number of samples to draw from the training set needed for training purposes of an estimator

Searching for the best value of hyperparameters was addressed to the GridSearch(GS) algorithm. In every training instance, the GS selected new optimum values of hyperparameters from the following sets:

  • \(n\_estimators=\lbrace 50, 100, 150, 200, 250, 300\rbrace \cdot N_{ld}\)

  • \(max\_features=\lbrace 0.8,\ 1.0,\ 1.2\rbrace \cdot \sqrt{N_{ss}}\)

  • \(max\_samples=\lbrace 0.8,\ 0.9,\ 0.95\rbrace \cdot N_{ss}\)

where: \(N_{ld}\) is the number of data points of a pattern; \(N_{ss}\) is the number of samples per shape in the current learning set.

Contrary to the RF method, NNs don’t offer the RA analysis, but they give only the final results without a possibility of tracing the rules governing the choice. The architecture of the NN that was used in this work remained unchanged for all shape recognition cases. The NN architecture was optimized by observation of precision and Matthew metric (see Sect. "The range of S(Q) function required for shape recognition") versus the number of layers and neurons. It was established that the application of the number of layers greater than 3 doesn’t improve the ML performance. The number of neurons was the same for each hidden layer but was reflection-range dependent. The NN classifiers were constructed as follows:

  • activation function of hidden layers - sigmoid

  • size of hidden layers against the peak ranges:

    • \(110\div 422\) - 1312 neurons

    • \(110\div 331\) - 1125 neurons

    • \(220\div 422\) - 1328 neurons

    • \(220\div 331\) - 875 neurons

    • \(311\div 422\) - 968 neurons

    • \(311\div 331\) - 1468 neurons

  • activation function of output layer - softmax

  • loss function - sparse_categorical_crossentropy

Similarly to the above, the number of epochs during training also depended on the Q-range, and it was established through experiments ranging from 300 to 600. The scheme of NN is shown in Fig. 5. In the case of shape prediction, only two hidden layers have been used. For surface identification, as many as three hidden layers were needed - Sect. "Surface structure identification".

Similarly to RF, the structure of XGB is based on Decision Trees. However, in this case, trees are not independent but built one after another. Each tree corrects errors of the previous ones. XGB also provides the RA analysis, training has been done by the GS method for the following hyperparameters:

  • \(max\_depth=3\div 5\) - controls the number of levels of the decision trees

  • \(learning\_rate=0.5\div 1.0\) - controls the speed of convergency to optimal model at each iteration

  • \(n\_estimators=30\div 60\) - similarly to RF, maximum number of samples to draw from training set needed for training purposes of an estimator

Contrary to RF, the XGB predictions have been weakly dependent on hyperparameter variations. The experimental data classification has been done for the same collection of hyperparameters but differs by the models used for training. In the case of shape classification, the max_depth, learning_rate, and n_estimators have been respectively set as: 4, 0.8 and 40; for the surface classification: 5, 0.7 and 40. It also was observed that XGB classifier training time was at least 10 times longer than RF or NN classifiers.

The above algorithms were tested to determine their level of noise resistance. The tests have been conducted on training sets by incorporating quasi-random noise that corresponds to that present in real data. It was found that even for relatively high levels of noise, which exceed 2–3 times that observed experimentally, the number of correct responses is comparable to that observed for noise free data. A higher level of noise, however, produces considerably more expanded decision trees and training time, especially in the case of XGB. Similarly, for NN, the number of neurons and hidden layers must be increased to compensate for the effect of noise. Since this procedure requires the use of a larger range of hyperparameters during optimization, it leads to an increase in the training time. Taking the above into account, to increase the reliability of shape and surface identification from experimental diffraction data, selection of low-noise Q-ranges and filtering of the examined data were applied, see Sect. “Shape determination”.

Fig. 5
figure 5

Scheme of NN architecture for nanodiamonds shape recognition. The number of hidden layers is type of classification dependent, in the case of surface type predictions, additional 3rd hidden layer was added.

ML classification of theoretically calculated X-ray diffraction patterns

The range of S(Q) function required for shape recognition

The goal of spectrum range analysis was to find the optimum number of the data points/features needed to determine the shape of nanodiamond grains. The initial S(Q) data consisted of 85001 points, covering the range between \(Q\approx 0.98\div 15.84\)Å−1 (\(2\theta =5^{\circ }\div 90^{\circ }\)). For this full Q-range, a high and satisfactory level of information redundancy was obtained. Nevertheless, to choose a consistent data representation, introduction of additional constraints was considered to reach three goals:

(i) determining the smallest number of peaks to analyze while maintaining a low level of errors, (ii) accounting for the presence of different types of distortions (like measurement errors, noise, high-level background, etc.) present in experimental data, and (iii) preventing the possibility of under- or over- fitting of classifiers.

The evaluation of classifiers was based on Matthews Metric (MM)48,49. The MM metric returns a number between −1 and +1, where −1 denotes an inverted prediction, 0 for random responses, and +1 for error-free predictions. Contrary to other metrics, the MM takes into account all kinds of predicted answers, which are assigned to one of four groups, i.e.,: true positive (TP), true negative (TN), false positive (FP) and false negative (FN). To evaluate the outcomes of a single classification, we also utilized the precision metric (PM), which is defined as: \(PM=\frac{TP}{TP+FP}\).

The optimum range of the data used for analysis by changing the beginning and width of the Q-range was examined with reference to MM. Since experimental S(Q) peaks above \(10\)Å−1 are weak and barely seen, the Q-range was limited only to the area covering the first seven strongest peaks, Fig. 6b. The calculations were performed for several Q-ranges starting from solely the 111 peak, and then the next peaks were subsequently included in the analyzed Q-range, up to the seventh 511 peak. Next, the Q-range was gradually shortened at the low Q side, and finally only the last 511 peak was included in the examined data.

Fig. 6
figure 6

(a) Matthews metric of ML classifiers trained on consecutively added peak collections (on x-axis are shown only Miller indices of boundary peaks), (b) normalized to highest peak S(Q) function (blue line) and corresponding RA (red line) for the collection of peaks ranging 111\(\div\) 422.

For each selected Q-range, the training and testing of the ML classifiers was performed, and the corresponding MM was calculated. Throughout this work, the data for training and validation of the classifiers were chosen from the initial set in accordance with the Pareto principle: randomly selected 80% of the models were assigned to the training set, while the remaining 20% were designated as a validation collection.

The summary of MM computations is shown in Fig. 6a. For the first four Q-ranges covering reflections 111, 111 \(\div\) 220, 111 \(\div\) 311 and 111 \(\div\) 400 one observes a continuous increase of MM, and then it remains constant when the next three reflections are included. The shape prediction classifiers achieve MM above 95% for all Q-ranges, except single reflection 511, when MM is only about 80 %.

The MM characteristics presented in Fig. 6a demonstrate that for a near perfect shape recognition, the Q-range covering the initial seven peaks is optimal; however the absence of 111 or even 220 peaks does not result in significant increase of misclassifications.

Indeed, the broad plateau of MMs plot in Fig. 6a, indicates a high level of information redundancy in the S(Q) spectrum and, therefore, any number or type or even a single peak may be used for shape analysis of nanodiamond grains. However, too small number of peaks increases the likelihood of unfavorable overfitting of ML classification algorithms, substantially limiting an effective and proper use of the available information. In consequence, it decreases the robustness of the analysis of the experimental data. This is particularly evident in the case of ML trained solely on the peak 511 where a significant decrease in the number of correct answers is observed.

The RA analysis presented in - Fig. 6b serves for further selection of the peaks and justification of choice of Q-ranges that are required according to MM analysis. The S(Q) function (blue line) is presented along with the plot of RA (red line) corresponding to full Q-range. The RA shows that the principal regions of analysis for the ML classifier are located near the peaks 111, 220, 400, 331 and 422 with a low significance of 311 peak and background. The RA also indicates that the RF algorithm examines specific areas multiple times, but the construction of the algorithm imposes that every data point is visited only once.

Fig. 7
figure 7

Experimental S(Q) data; non shaded area is taken for classification by ML. Q-ranges taken for analysis are shown on top.

ML classification of experimental data

Shape determination

Experimental diffraction data were obtained for nanodiamonds synthesized from the adamantane \(C_{10}H_{16}\) molecule precursor under HP-HT conditions50. The samples with the average grain size of 1.19, 1.2, 1.28, 2.70 and 3.30 nm were selected for the present study. All samples were identified as plates terminated by (111)B surfaces with three dangling bonds31. X-ray powder diffraction was taken with the Bruker D8 Advance diffractometer equipped with an Ag-anode radiation source.

Fig. 8
figure 8

(a) average MM metrics of shape determination for RF/NN/XG classifiers trained on MD data, (b-d) results of experimental S(Q) data shape classification by RF/NN/XG classifiers respectively.

Principal data reduction was performed with PDFgetX2 software51. The experimental data originally contained 1321 points, with \(Q\approx 0.98 \div 21.6\)Å−1\((2\theta =5^{\circ }\div 150^{\circ })\), which were mapped to the theoretical data points by cubic spline interpolation. The experimental S(Q) functions were subjected to noise reduction by employing the 5-rank, Savitzky-Golay filter and background correction, see in Fig. 7. More details on background correction are given in Supplementary Materials, Fig. S1 and Fig. S2. Next, for the shape classification processing, the points above \(Q=9.0\)Å−1 were excluded (shaded area in Fig. 7) due to insufficient data quality.

Based on the full collection of theoretical S(Q) functions described in Sect. "The range of S(Q) function required for shape recognition", using the same protocol as above, the ML classifiers were trained utilizing RF, NN and XGB algorithms and applied for six different sets of Bragg peaks. For all classifiers, the MM metrics were above 90%. For each group of generated classifiers, the one with the highest MM was selected and used for shape prediction of the experimental data.

The MM metrics for classifiers within the selected groups of peaks are shown in Fig. 8a. The MM values range from 96% to 99% for RF/NN classifiers and decrease with reducing the Q-range. For XGB classifiers, higher values of MM metric are observed and they show weak dependence on the number of peaks under examination.

The results of shape recognition from the experimental data by ML classifiers based on the RF algorithm are shown in Fig. 8b. The shape of the largest, 3.3nm grains is recognized as plates by all classifiers. For medium-sized samples, there are two types of shapes being recognized, while the dominating shape is plate. The grains of the smallest size samples are recognized as either, plate- or rod-like shapes, depending on the data range analyzed. In this case, the rod-like shapes are recognized by ML classifiers trained on the data with 111 peak present, plates are recognized by classifiers with 111 reflection excluded. An exception is observed for the two smallest grain samples, where rod-like shapes are identified by classifiers when 311 and 422 peaks are taken into account.

The results of shape classification by NN are shown in Fig. 8c. There is a full agreement between the RF and NN classifiers for three sets of Bragg peaks, i.e.: \(220\div 422\), \(220\div 331\), \(311\div 331\), for which the plate-like shapes are recognized. Also, similar result has been found for the \(311\div 422\) peaks group, while the supersphere shapes are detected for the sample containing the smallest grains of nanodiamods.

Similarly to above, the shapes predicted by XGB, Fig. 8d, show the rods for S(Q) data with the 111 peak included; if the 111 peak is ignored, plate-like shapes dominate. Furthermore, the XGB classifiers indicated superspheres for two cases, but these findings don’t cover predictions of RF/NN classifiers. They are identified for the largest and medium-sized models when S(Q) data start from 331 peak.

Surface structure identification

The results presented in Sect. “Shape determination” indicate that all types of classifiers are capable of recognizing the shape of nanodiamonds and are fully consistent with our previous findings11, so it allowed studies to be extended by plate surface analysis..

Fig. 9
figure 9

Topography of (111) surface types of plate-like shapes; the lighter colors indicate atoms with 3 dangling bonds, while the darker one are assigned to atoms with 1 dangling bond.

Fig. 10
figure 10

(a) average MM metrics for RF/NN/XGB classifiers trained on MD data (b-d) results of experimental surface classification based on S(Q) data, respectively by RF/NN/XGB classifiers.

There are two subtypes of (111) surface denoted in this work as A and B, which differ by the number of dangling bonds of surface atoms, as shown Fig. 9. Accordingly, the plate-shaped nanodiamonds may be of 3 types, which are terminated A-A, B-B, or A-B pairs of 111 surfaces31,. Subsequently, the plate-like models from the database have been selected and assigned to a suitable group, in accordance with the methodology outlined in the preceding Sections. The number of models available for training was smaller in this case – only plates were needed – and consequently the selection and training protocols were modified. The number of models in a bin was set to 15 and the total number of models in each category was 750. To obtain high values of MM metric, c.f. Fig. 10b, and to prevent experimental/numerical/overfitting errors, every S(Q) pattern was taken 5 times, by shifting the given pattern along the Q-axis, see Sect. "Selection and creation of ML classifiers".

The training of RF and XGB classifiers proceeds similarly to the shape recognitions case; see Sect. “Shape determination”. The generation of NN classifiers here required a modification by its extension to three deep layers, every one with the same number of units but dependent on the size of the data, see Sect. "Selection and creation of ML classifiers". An activation function was based on the Gaussian Error Linear Unit (GELU) method with weight initialization proposed by52. The Batch Normalization technique was used to normalize the results of each layer (except for the outermost one). A strong overfitting phenomenon was observed when training was done for a very narrow Q-range of S(Q) data with \(311\div 422\) peaks and \(311\div 331\) only. To correct for this, the Dropout technique was employed, with a dropout rate of 20%.

The MM metrics for RF/NN/XGB classifiers are shown in Fig. 10a. The MM values of all classifiers are about 95% when at least 4 peaks are taken, starting from either 111 or 220. A significant decrease of MM is observed only for the pattern range starting from the 311 peak, especially for NN. In the majority of cases the MM values for RF classifiers are slightly higher than those for NN and XGB.

The results of the application of ML algorithms to experimental patterns are presented in Fig. 10b-d and show classifiers recognize the B-B surface configuration as dominating if the 111 peak is ignored. This result also conforms to the shape recognition case when inclusion of the 111 peak strongly affects the result of shape identification, see Sect. “Shape determination”.

Discussion

Although information on shape of nanograins is definitely contained in the diffraction patterns, a combination of multiple factors affecting the shape of the diffraction data effectively excludes shape determination of nanocrystals with the use of conventional analytical tools dedicated to polycrystalline materials.

A critical problem that appears when asking for shape recognition of nanograins comes from the fact that the atomic structure of an individual grain depends on its size, shape and specific structure of its surfaces. In this work we show that application of ML to shape analysis of nanodiamonds allows one to overcome these difficulties and may be an effective tool for identification of the shape of nanograins. The crucial stage of grain shape recognition is the collection of reference diffraction data, which in our case are theoretical patterns calculated for MD simulated models of grains. An ML is able to effectively analyze the network of interrelations between various parts and features of the diffraction patterns of grains with various shapes and surfaces. ML classifiers are capable of differentiating the objects, even if a thorough comparison of their diffraction patterns reveals only hard to quantify fuzzy differences.

The results presented in the paper demonstrate that the performance of the ML classifiers is strongly dependent on statistics hidden in the collections of models used for training. For example, the predictive abilities diminish significantly in the cases where the training sets do not cover the full range of the analyzed crystallite sizes. Thus the rules for grain shape determination must also take into account the possibly broad range of the grain sizes. This conclusion illustrates a well-known issue called out-of-distribution degradation53,54,55.

Analysis of MM metrics derived for the examined cases shows that classifiers require the delivery of diverse statistics to efficiently process information contained in the diffraction patterns. The shape classifications of the modeled particles indicate that ML classifiers are capable of following the rules imposed on the internal atomic structure of individual grains by the strains developed from the grain surfaces. It is also apparent that ML algorithms indirectly incorporate in the analysis the deviation of the atomic structure of nanograins from the perfect crystal lattice which varies depending on the specific atomic structure of the grain surfaces. This feature specific to a few nm-sized grains is of key importance for identification of atomic architecture of basal planes of nanodiamond plates, which in our case are terminated by carbons with three dangling bonds.

An independent check of ML shape classification of the experimental data reported in Sect. “Shape determination” is provided by the results presented in11. Based on very detailed numerical analysis it was demonstrated, that nanodiamonds synthesized from chloroadamantane take the plate-like shape, terminated on both sides by B-type surfaces. Exactly such results are provided by both RF, XGB and NN classifiers trained on the diffraction patterns not containing the 111 peak, with the exception of the smallest size grains and classifiers trained on very limited Q-ranges. It is apparent that the presence of 111 reflection in the experimental diffraction patterns under examination confuses the ML classifiers. The reason for such deficiency may be a combination of instrumental and calculation errors. The experimental diffraction data were obtained at the at-home diffractometer, where at low diffraction angles the direct beam divergence, the opening of the incoming and receiving slits, and the detector geometry are difficult to precisely account for. Also, the intensity correction factors like Lorentz-Polarization are large and quickly changing at low diffraction angles, and so computational errors may result in incorrect intensity and shape of the 111 peak.

A growing number of research teams are addressing the topic of crystal structure parameters identification by ML algorithms. They predominantly employ neural networks for analyzing selected topics of significance in this domain, such as lattice parameter estimation 20 and pattern classification56,57. This paper further explores these topics. It illustrates that utilizing three distinct ML algorithms instead of a single one enables the attainment of greater confidence in the outcomes.

The reported identification of the grain shape from the experimental data may be considered satisfactory, although it is not fully unequivocal. Uncertainties concerning shape identification are not unexpected since they can result from both, the samples’ characteristics, e.g., particle size distribution and shape distribution, and the experimental deficiencies, e.g., statistical noise. One must also note that while analyzing real nanomaterials, one searches for a preferred or most frequently occurring particle shape. Depending on the particular sample, the answer may not be necessarily definite or univocal, e.g., if none of the shapes occurs more frequently than the others.

Summary

This work presents an analysis of the shape of nanodiamond grains based on diffraction data with the application of Machine Learning techniques. The shape of nanodiamond grains was recognized based on the database created from theoretical structure functions S(Q) calculated for models of the grains after their relaxation using MD simulations.

The classifiers were generated using three distinct methodologies, namely Random Forest, Neural Networks, and Extreme Gradient Boosting. It is demonstrated that the above methods produce ML classifiers with similarly high values of correct classifications. RF/NN/XGB classification can discern between grains with the same shape but different surfaces, namely show that nanodiamond plates are covered by 111 surfaces terminated by atoms with three dangling bonds.

The paper demonstrates the high degree of shape information redundancy in the diffraction data. It allowed us to safely reduce the actual range of the data identification and remove the pattern areas most prone to experimental deficiencies without negative impact on shape and surface predictions.