Introduction

The microstructure of polycrystalline metallic materials is a very important factor in determining their physical properties1,2,3,4. Predominantly, the grain structure stands as the most observable microstructural feature in most polycrystalline metals, wielding a significant influence on the structural behavior of the metal1,5,6. Specifically, aspects such as grain shape (morphological texture) and lattice orientation (crystallographic texture) can lead to mechanical property directional anisotropy5,6,7,8. In addition, morphological and crystallographic texture can also influence the contribution of different deformation mechanisms, such as dislocation slip and deformation twinning, to the overall deformation behavior of a metallic material7,8. As such, the grain structure has the potential to be engineered for enhancing specific properties like strength and ductility7,9,10,11. Accurately defining the correlation between the grain structure of a metal and its properties is critical for developing useful computational methods for predicting and designing the mechanical response of materials.

Crystal plasticity simulations offer a powerful tool for studying the influence of grain morphology and lattice orientation on the mechanical response of crystalline materials12,13,14,15,16,17,18. Together with a theory representing the elastic-plastic response of the single crystals, crystal mechanics simulations are done by synthesizing a grain structure that mimics the microstructural characteristics of the material. Typically, the information for these grain structures is initially obtained through electron backscatter diffraction (EBSD) datasets, which provide detailed mappings of grain orientations. A great deal of effort is required to ensure that the characteristics of the synthesized grain structure accurately reflect the actual microstructure. Grain morphology is usually modeled using 3D software such as Dream3D, where the number of grains in the synthetic structure can vary from a few tens to thousands, influenced by the available computational resources and the choice of the crystal plasticity simulation approach13,14,15,16.

However, while typical EBSD datasets for grain structures may encompass hundreds to thousands of grains, the grains included in a representative volume element (RVE) might be reduced to manage computational expenses effectively. The extent of this reduction can vary, depending on the specific requirements and objectives of the simulation. Such reductions necessitate meticulous attention to ensure that the comprehensive grain information is accurately represented in the RVE13,15,16,17,19. Given the complexities involved in capturing such detailed microstructural information within a reduced-order dataset, maintaining the integrity of the crystallographic texture without overwhelming computational resources is a challenge in crystal plasticity simulations13,15,16,17,19.

Generating accurate reduced-order Euler angle datasets is relevant not only for crystal plasticity simulations but also for enhancing the effectiveness of other modeling techniques such as reduced-order modeling and surrogate modeling20,21,22,23,24,25,26. These methods aim to streamline complex physical models, facilitating faster simulations while maintaining high accuracy. For instance, in reduced-order modeling, key characteristics of a material’s behavior can be efficiently encapsulated with fewer degrees of freedom, proving essential for real-time simulations and iterative design processes21,22,23,24,26. Similarly, surrogate models, which act as proxies for more intricate simulations, can be optimized through the use of reduced-order datasets that capture microstructural characteristics critical to material properties20,25. As these modeling approaches are still evolving, particularly in their capacity to predict material deformation behavior, they stand to gain significantly from advanced methods that efficiently compress dataset sizes while accurately representing the microstructure of the material.

Machine learning (ML) based approaches have already been utilized to solve complex problems that are associated with EBSD datasets27,28,29,30. For instance, K. Kaufman et al. demonstrated the use of deep learning, specifically a few-shot transfer learning approach, for classifying electron backscatter diffraction patterns (EBSPs)27. By leveraging transfer learning, the authors significantly accelerated the model training process with limited data, demonstrating the effectiveness of the method and efficiency in classifying complex EBSPs. Britton et al. introduced an unsupervised ML approach segment EBSPs of γ matrix from γ precipitate in a Cu/Ni-based super alloy28. Backscatter patterns produced by an alloy solid solution matrix and its ordered superlattice exhibit only extremely subtle differences, due to the inelastic scattering that precedes coherent diffraction. K. Krishna et al. have successfully utilized conditional generative adversarial networks (c-GANs) for the de-noising of EBSPs, leading to improvements in indexing success rates and pattern accuracy 29. Z. Ding et al. used generative models to simulate EBSPs, showcasing the potential for more flexible analysis frameworks30. Hence, distinguishing the precipitate from the matrix presents challenges. The developed method successfully did that. These contributions reflect the evolving application of ML in addressing the challenges associated with traditional EBSD analysis, enabling more precise and efficient materials characterization.

In this research, we introduce a non-supervised ML approach, termed texture adaptive clustering and sampling (TACS), which incorporates K-means clustering and density-based sampling to generate reduced-order datasets for large EBSD datasets. In various research domains, K-means clustering, and density-based sampling have been used to address complex problems. For instance, these techniques have been used in cancer research to classify patient subgroups, aiding in personalized treatment strategies31,32,33,34. Similarly, density-based sampling has been utilized in agriculture to enhance crop yield predictions by analyzing field spatial variability35. Furthermore, these methods have been utilized in composite engineering for analyzing material properties and predicting composite behavior under stress conditions36,37. A. Leitherer et al. used K-means clustering for segmenting atomic signals from background noise in scanning transmission electron microscopy images, further showcasing the versatility of the method in material science research38. These examples illustrate the adaptability of K-means clustering and density-based sampling in extracting useful information from complex datasets across different fields.

The developed algorithm in this study consists of two main steps: firstly, identifying the optimal number of clusters using K-means clustering; secondly, conducting density-based sampling from these clusters with iterative refinement to generate a representative dataset. We initially demonstrate the technique using EBSD datasets from a quenched and tempered steel. Then the TACS approach is tested on a broad array of materials with distinct crystallographic textures and lattice structures demonstrating the robustness of the developed method.

Results and discussion

Proof-of-principle evaluation

The pole figure (PF) maps of the EBSD dataset of a rolled and recrystallized low carbon (0.23 wt.% C) steel that was used to develop the algorithm are depicted in Fig. 1a. The rolling direction (RD) and normal direction (ND) of the 12.7 mm thick plate are indicated for reference. The EBSD analysis identified only the iron-BCC phase, resulting from the quenching step in the manufacturing process. There is a possibility that EBSD did not detect retained austenite or carbides in the material or mis-indexed the patterns39. However, this is not related to the scope of this study. A step size of 0.250 µm was used to collect the data and the dataset consists of 165088 indexed points. A roughly 120 × 90 µm2 region was scanned. The dataset consists of 2068 grains, hence the dataset is large enough to represent the grain structure of the material. The pole figure (PF) maps reveal a weak (100) texture—twice the random distribution—in the rolling direction of the plates. The (110) and (111) directions also show weak textures at an angle of 45 degrees to the RD, possibly due to the (100) texture in the RD. The weak texture of the EBSD dataset, characterized by more randomly oriented grains with a broad variation in orientations, poses a significant challenge in generating a representative reduced-order synthetic Euler angle dataset. Accurately capturing the diverse orientation characteristics of such weakly textured materials typically requires a large number of grains, in contrast to strongly textured materials where fewer grains suffice.

Fig. 1: Comparison of the PF of the Euler angle datasets generated from different methods.
figure 1

a Raw dataset, b PDF mapping, c KDE approach, d TACS approach. Visual comparison of the PF maps indicates that TACS approach can generate the most accurate representative datasets.

Considering that crystal plasticity simulations typically use 100 to 500 grains to model polycrystalline materials, a dataset size of 350 grains was chosen for this case study. To evaluate the effectiveness of the TACS approach, datasets were also generated using both probability distribution function mapping (PDF) and kernel density estimation (KDE) approach by utilizing the MTEX toolbox in MATLAB. The PF maps of the generated representative datasets are presented in Fig. 1. A visual examination indicates that the PF maps derived from the PDF mapping method (Fig. 1b) fail to adequately capture the texture of the grain structure even with 512 data points. While the PDF method is capable of providing a general statistical overview of orientation data, it inherently lacks the granularity and spatial consideration required to accurately capture the complex information present in EBSD datasets. Moreover, the PDF approach involves discretizing continuous data into bins or intervals. When applied to the orientations in an EBSD dataset, this process can inadvertently introduce artificial periodicity into the generated PF map if the bin sizes or the distribution assumptions do not align perfectly with the actual data distribution which is the reason for the observed periodic pattern in the (100) PF map. The PF maps from the KDE approach (Fig. 1c) exhibit a significant improvement over those from the PDF mapping. However, these maps still do not align well with the PF maps of the raw dataset. In fact, the weak texture present in the grain structure is not captured by the generated dataset. This can be attributed to the bandwidth sensitivity of the KDE method, which is crucial in defining the smoothness of the estimated density function. If the bandwidth is too large, the KDE will oversmooth the data, leading to a loss of local texture details and failing to capture sharp features of the pole figure. Conversely, a narrow bandwidth can lead to overfitting, where noise in the data is mistaken for actual textural features. Therefore, both PDF and KDE methods are not effective to generate representative reduced-order synthetic Euler angle datasets for the EBSD dataset of a rolled and recrystallized low carbon (0.23 wt.% C) steel. While it may be possible to adjust the bandwidth parameters in KDE to obtain a more accurate representative dataset, these results indicate that neither method can reliably generate representative reduced-order Euler angle datasets for an arbitrary EBSD dataset without significant manual, and potentially biased, adjustment.

The PF maps produced using the TACS approach, as shown in Fig. 1d, demonstrate a substantial enhancement in replicating the PF maps of the raw dataset compared to the earlier two methods. The PF maps are almost identical to the PF maps of the raw datasets where the weak (100) texture in RD is accurately captured by the generated dataset. Moreover, the weak texture in (110) and (110) directions also are captured by the generated dataset. The texture in (111) is slightly higher than the texture in the raw dataset. However, overall, a visual comparison of the PF maps generated by the different techniques suggests that the TACS approach significantly outperforms the PDF and KDE mapping methods.

The statistical distribution of the generated datasets was quantitatively assessed by performing the Kolmogorov-Smirnov test (K-S test). The Kolmogorov-Smirnov (K-S) test is ideal for comparing the generated datasets because it is a non-parametric test that assesses whether the datasets are accurately drawn from the original distribution without assuming any particular distribution shape40,41,42. The K-S test was performed for each Euler angle, and the average was taken. The p-values for the PDF mapping, KDE approach, and the TACS approach are 0.001, 0.23, and 0.94, respectively. The lower p-values for the PDF mapping and KDE approach suggest that these methods may not effectively generate representative datasets that accurately reflect the raw EBSD datasets. Specifically, the PDF method, with a significantly lower p-value, indicates that it is less effective in generating representative Euler angle datasets. Consequently, the PDF method was excluded from further analysis in favor of assessing the performance of the TACS approach, with only the KDE approach considered for comparison. In contrast, the higher p-value for the TACS approach, being close to 1, suggests its suitability in generating representative Euler angle datasets that can accurately mimic the weak texture of this dataset.

Therefore, the TACS approach demonstrates its potential capability in accurately capturing weak textures within materials, a critical aspect often overlooked in traditional methods. In the case of the rolled and recrystallized low-carbon steel with its weak (100) texture, the method effectively replicates these subtle orientation characteristics. Unlike conventional approaches which tend to overlook or inadequately represent such weak textures, this method ensures that even the slightest variations in grain orientations are accurately reflected in the synthetic datasets. This precision is particularly crucial for materials where weak textures play a significant role in their overall properties and behavior.

Robustness assessment across diverse EBSD datasets

The robustness of the TACS approach was assessed using twenty varied EBSD datasets. Nine datasets were authored, while eleven datasets were sourced from literature, covering a range of crystallographic textures and lattice structures (cubic, hexagonal, tetragonal). These datasets included intricate PF map patterns and varied from 102 to 104 grains, offering a comprehensive statistical foundation for validation. The smallest datasets, numbered 18 and 20, contained 400 and 277 grains respectively. Due to the limited availability of larger EBSD datasets for these specific minerals, these were the largest datasets that could be sourced. Further details about the datasets are provided in the method section. Overall, this diversity in the datasets allowed testing of the adaptability of the method to different intrinsic material characteristics.

Similar to proof-of-principle evaluation on rolled and recrystallized low-carbon steel process, 350 data points were initially used to generate representative Euler angles for each dataset. For comparison, the pole figure (PF) maps of raw datasets 4, 9, 13, and 19, along with the PF maps of the corresponding representative datasets generated from the TACS approach and KDE approach, are presented in Fig. 2. These datasets are a subset sample to represent the twenty datasets used in this study. For example, the experimental datasets of 4 and 9 exhibit strong crystallographic textures, whereas the other two datasets do not. Moreover, the lattice structure of dataset 4 is face-centered cubic (FCC), while datasets 9 and 13 are hexagonal close-packed (HCP), and dataset 19 is body-centered cubic (BCC). The PF maps of the representative Euler angle datasets generated via the TACS approach are almost identical to the PF maps of the raw datasets 4 (Fig. 2a), 9 (Fig. 2b), and 13 (Fig. 2c). Specifically, the strong textures in datasets 4 and 9 are accurately mimicked by the generated datasets. However, there is a deviation in the texture of the (11\(\bar{2}\)0) PF map of dataset 19 (Fig. 2d), where the maximum texture orientation is overestimated by approximately 10% in the TACS approach.

Fig. 2: The comparison of the PF maps of the selected datasets with 350 datapoints.
figure 2

a Dataset 4, LPBF SS316L, b Dataset 9, DED Ti-6Al-4V, c Dataset 13, low carbon steel (0.18%C)23, d Dataset 19, quartz24. The PF maps of the datasets generated by the TACS approach are almost identical to the PF maps of the raw datasets whereas the maps generated by the KDE approach failed to capture the texture of the raw dataset accurately.

Contrastingly, the PF maps of the representative datasets generated via the KDE approach show significant deviations in crystallographic texture. The strong textures in datasets 4 (Fig. 2a) and 9 (Fig. 2b) are significantly lower in the generated datasets. The textures in datasets 13 (Fig. 2c) and 19 (Fig. 2d) are also underestimated in the generated datasets. For instance, the (10\(\bar{1}\)0) texture in dataset 19 (Fig. 2d) is not reflected in the generated dataset. The K-S test scores for the four datasets generated from the TACS approach are 0.89, 0.95, 0.90, and 0.95, respectively. In contrast, the K-S test scores for the four datasets generated from the KDE approach are 0.01, 0.16, 0.01, and 0.02, respectively. This further indicates that the datasets generated by the TACS approach are a better statistical representation of the raw datasets. The PF maps of the other datasets are presented in Supplementary Information 1.

It is important to note that the datasets were generated by providing only the raw datasets and the required number of data points for the representative dataset. No parameters were modified in the algorithm to obtain these representative datasets. Therefore, the TACS approach is capable of autonomously generating accurate representative datasets without the need for human intervention or bias. This contrasts markedly with techniques such as the KDE approach, which often require manual adjustments of parameters to accurately capture the crystallographic textures of materials. The autonomous nature of TACS not only streamlines the dataset generation process but also eliminates potential biases and inconsistencies inherent in manual parameter tuning. This capability ensures that TACS can consistently produce high-quality datasets, reflecting the actual grain structures with minimal user input, thus significantly improving the practicality and applicability of the method in diverse material science research contexts. Moreover, this study demonstrated that TACS could generate representative datasets for EBSD datasets with different lattice structures, such as body-centered cubic, face-centered cubic, hexagonal, orthorhombic, and monoclinic. This demonstrates the versatility of the approach and its capability to handle complex lattice structures, suggesting promising potential for application to low-symmetry lattices such as triclinic systems.

The versatility of the TACS approach is further demonstrated by its ability to effectively handle datasets of varying dimensions and complexities, similar to how it processes \(M\times 3\) matrices for Euler angles, where M is the number of Euler angles in the EBSD dataset. The TACS algorithm is designed to recognize and maintain the relationships between coupled data columns, ensuring that each set of parameters, whether they represent Euler angles or other microstructural features such as grain size, aspect ratio, and the number of neighbors, is treated as a linked entity. This capability is relevant when feeding the algorithm with a \(P\times Q\) matrix, where P represents the number of grains and Q represents different parameters. By employing K-means clustering alongside density-based sampling, the TACS approach can effectively preserve the intrinsic coupling of these parameters, maintaining the integrity and coherence of the microstructural characteristics in the reduced-order datasets. This adaptability underscores the potential of TACS to serve as a foundational tool in the development of advanced material models that require the integration of complex and varied data inputs.

Furthermore, the developed TACS approach is versatile enough to meet diverse research needs. For instance, it can be employed to assign Euler angles to grains within the three-dimensional space of an RVE. This advanced application is particularly important for ensuring that the microstructure synthesized within the RVE accurately represents the textural characteristics across its entire volume, a necessary feature for crystal plasticity simulations. As a proof of concept, we have successfully applied this extended TACS method to Dataset 4, LPBF SS316L. A detailed demonstration of this methodology is presented in Supplementary Information 2. This example not only highlights the capability of the method to transition from 2D to 3D textural representations but also underscores the potential of TACS to facilitate complex modeling tasks. However, it is important to note that the current implementation best handles scenarios with uniform textural distributions. True 3D textural data, which accounts for variations across the thickness of a sample, would require additional experimental methods such as serial sectioning with FIB-EBSD or 3D X-ray imaging. The challenge of inferring 3D information from 2D maps cannot be completely resolved through computational approaches alone due to the inherent lack of depth information in 2D analyses. This limitation underscores the need for further experimental studies to fully capture the three-dimensional architecture of microstructures when such detail is necessary.

Performance analysis

The performance of the TACS approach was evaluated by varying the number of data points from 10 to 500 in the Euler angle datasets. This enables a quantitative evaluation of the effectiveness of the TACS approach in reducing dataset size while capturing the nuanced characteristics of the crystallographic texture. Eleven test cases were created by incrementing data points by 50, plus two additional conditions at 10 and 25 points. Given that datasets 18 and 20 originally consisted of only 400 and 277 grains, respectively, the TACS approach was used to generate additional data points for cases requiring more than the original dataset. The increase in grain count is achieved by synthesizing grains that are consistent with the observed crystallographic textures and densities within each cluster. This methodological step involves inferring additional grains based on the statistical properties and spatial distributions of the data points within each cluster, effectively increasing the granularity of the dataset. This could also be helpful when the EBSD dataset is not large enough to develop computational models. The K-S test was used as the metric to assess the statistical representation of the raw data, with results illustrated in Fig. 3. Datasets from KDE approach had K-S values under 0.35, indicating a poor statistical representation, even with datasets with 500 data points. Conversely, the TACS approach yielded K-S values above 0.5, with 80% exceeding 0.7 even for the datasets with only 50 data points, demonstrating a significant improvement.

Fig. 3: Variation of K-S test Scores in different sample sizes from 10 to 500 data points.
figure 3

The K-S test scores achieved using the TACS approach consistently exceed 0.5, demonstrating its ability to statistically represent raw datasets. In contrast, the KDE approach method fails to reach even 0.3, underlining its limitations in accurately characterizing the datasets even with increased data points.

For datasets with strong textures, like 4, 5, and 9, datasets with 10 and 25 data points resulted in K-S values below 0.5. However, for the raw datasets with weak textures, even datasets with 10 and 25 data points maintained K-S values above 0.5 due to inherent randomness. However, the datasets generated with 50 data points for all twenty datasets reported K-S values higher than 0.7. This indicates that the TACS approach can reduce the dataset size by one to two orders of magnitude without losing the statistical integrity of the raw dataset. The K-S scores did not consistently correlate with the number of data points, varying between 0.6 and 1, reflecting the random nature of sampling, the sensitivity of the clustering algorithm, and the complexity of EBSD data.

The variation in the PF maps with the number of data points was also studied. For comparison, PF maps of the generated datasets with 25 and 50 data points for the same datasets presented in Fig. 2 are depicted in Fig. 4. The PF maps of the generated datasets with 25 data points for datasets 4 and 9 resemble the crystallographic texture of the raw dataset. However, the strong texture of dataset 4 in the (110) direction is underestimated in the generated dataset, where the peak is six times the random distribution compared to 10 times in the raw dataset. Similarly, the strength of the crystallographic texture in dataset 9 is also underestimated by the generated dataset. The PF maps of generated datasets with 50 data points for datasets 4 and 9 not only mimic the crystallographic texture of the raw dataset but also match the strength of the texture in the raw dataset. Based on this comparison, it can be deduced that generated datasets with 50 data points represent the smallest dataset size that can accurately mimic the crystallographic texture of the raw dataset. The raw datasets 5 and 9 consist of 2588 and 8572 grains. Hence, the TACS approach reduced the dimension of the raw dataset size by 1 and 2 orders of magnitude without losing the textural information of the two raw datasets. Similar results were achieved for the other datasets that also have strong crystallographic textures that is higher than 2.5. In contrast, the PF maps of the generated datasets for datasets 13 and 19, which have weak crystallographic textures, do not resemble the weak textures present in the raw dataset. In fact, the generated datasets have been subjected to overfitting by the TACS approach. The higher randomness in these datasets could be attributed to this behavior. Therefore, datasets with a higher number of data points, such as 350, would be required to mimic the crystallographic texture of a material with a weak texture. This observation was consistent across other datasets with weak crystallographic textures below 2.5. Overall, TACS approach successfully generates representative datasets that are smaller by 1-2 orders of magnitude compared to the raw dataset, while accurately mimicking both the crystallographic texture and statistical characteristics of the raw datasets.

Fig. 4: The comparison of the PF maps of the selected datasets with 25 and 50 datapoints.
figure 4

a Dataset 4, Laser powder bed fusion (LPBF) SS316L, b Dataset 9, Directed energy deposition (DED) Ti-6Al-4V, c Dataset 13, low carbon steel (0.18%C)23, d Dataset 19, quartz24. The PF maps of the datasets 4 and 9 resembles the strong texture of the raw dataset. Datasets 13 and 19 does not represent the weak texture of the raw dataset.

Another implementation of the performance analysis is that it can help to determine the optimal number of datapoints to represent a grain structure. While TACS can significantly reduce the size of datasets without losing textural information, the K-S score allows for a systematic approach to identify the minimum dataset size that maintains the integrity of the crystallographic texture and statistical characteristics of the raw datasets. In addition to the K-S test score, the orientation distribution function (ODF) also can be employed to determine the optimal number of data points.

To summarize, the TACS method developed in this study offers an improved approach for replicating weak texture in low-carbon steel. This method demonstrates enhanced performance over traditional techniques, particularly in accurately representing subtle crystallographic textures. Comprehensive validation of additional materials and dataset size variations confirmed the statistical accuracy of TACS. Significantly, the performance of the method, especially in terms of K-S scores, indicates its efficacy over existing methods. Furthermore, the TACS method distinguishes itself by its ability to autonomously generate accurate datasets without human intervention and bias, streamlining the data preparation process, and reducing the potential for manual error. By accurately mimicking the actual grain structures, the proposed approach enables the development of more RVEs. These enhanced RVEs lead to simulations that are not only more precise but should also require fewer computational resources. As a result, this approach promises substantial improvements in the predictive modeling of metal behaviors under various stress conditions, crucial for advancing material science and engineering.

Methods

The details of the dataset preparation and algorithm of the TACS approach, PDF mapping, and KDE approach using MTEX built-in functions are presented below. The EBSD raw datasets were pre-processed using the MTEX toolbox available for MATLAB and the clustering was performed in the python environment.

Preparation of the raw datasets

Datasets gathered by authors were collected by using two microscopes. Datasets 1-6 were collected using a Zeiss Gemini 300 Field Emission Scanning Electron Microscope (FESEM) equipped with an Oxford Instruments CNano EBSD detector. Datasets 8-9 were collected using a JEOL 7400 FESEM equipped with an Oxford Instruments Symmetry EBSD detector. Data collection was performed at 30 kV with various step sizes. For each dataset, the step size was determined by performing a low-resolution, fast EBSD scan on the selected location. Samples for the EBSD were prepared by mechanical polishing. The polishing consisted of multiple steps, beginning with a coarse polish using standard silicon carbide grit papers of grit sizes 240, 320, 400, 600, 800, and 1200 grades, followed by fine polishing using 3 µm and 1 µm polycrystalline diamond suspensions. Finally, vibratory polishing was performed using 50 nm colloidal silica to remove all the work hardening accumulated on the surface from the previous polishing steps. The details of the datasets that were gathered from the literature are provided in the relevant sources.

The process of preparing EBSD datasets for analysis involves a series of steps, each integral to ensuring the accuracy and usability of the data in subsequent analyses. Initially, the datasets, stored in the .ctf format, are processed using the MTEX toolbox in MATLAB. MTEX is adept at handling and interpreting the complex data contained in EBSD scans, making it an ideal choice for the initial processing phase. During this phase in MATLAB, the primary focus is on extracting Euler angles from the EBSD data. These angles are extracted from the .ctf files and are arranged into a matrix with dimensions Mx3, where M represents the number of measurements or data points in the dataset. Each row of this matrix contains three Euler angles, corresponding to the three-dimensional orientation of a crystal at a specific point in the material. For the materials with more than one phase, the primary phase with the highest percentage of indexing was selected for the analysis. Grains less than five pixels were ignored for the analysis. Once the Euler angles are successfully extracted and organized into this matrix, the data is then saved into a MATLAB file with a .mat extension.

The subsequent step involves transitioning the data into a format that is more amenable to Python-based analysis. To achieve this, the .mat files are converted into .npy files, a format native to NumPy, a fundamental package for scientific computing in Python. This conversion is performed using Python scripts, which read the .mat files, extract the Euler angles array, and save it as .npy files. The .npy format is particularly suited for storing large arrays efficiently and allows for easy loading of the data into Python for further processing. This conversion to .npy files is a critical step as it opens up the data to the extensive ecosystem of Python libraries and tools, particularly those tailored for data analysis and machine learning.

The datasets are summarized in Table 1, which provides metadata for each dataset, such as material name, fabrication method, number of grains, strength of the texture, primary phase(s), and lattice structure. The datasets feature a variety of fabrication methods, including quenching, tempering, and two additive manufacturing techniques: DED and LPBF. They cover a range of metals, including SS316L, Ti-6Al-4V, and low-carbon steel. Datasets 10-15 consist of a duplex microstructure with two primary phases: ferrite and bainite. In this study, both were considered as a single phase because EBSD cannot distinguish between the two. Both phases have a BCC lattice structure with similar lattice parameters, hence the EBSD detector cannot distinguish between them. In the original study43, the authors have adopted deep learning and kernel average misorientation (KAM) maps to deconvolute these two phases. Additionally, datasets 17-20 consist of minerals, allowing us to validate our method on non-metallic materials as well. These datasets also help to verify that our approach is independent of the underlying characteristics of the EBSD dataset. The strength of the crystallographic texture varied substantially among the datasets, with the highest reported for dataset 9 at 5.89. Datasets 1,2, and 10-16 showed the weakest texture, with the strength of the texture close to 1, suggesting a random grain structure. Moreover, each dataset showed unique patterns in their PF maps. The largest dataset, dataset 8, consisted of 14,973 grains. All datasets, except for 18 and 20, consisted of more than 1000 grains.

Table 1 Datasets used to validate the TACS approach

TACS approach

The algorithm, illustrated in Fig. 5, begins with an EBSD dataset and aims to produce a representative subset of data points. It first determines the optimal number of clusters using K-means clustering to capture the inherent variability of the data. The optimal cluster count is identified by minimizing the within-cluster sum of squares (WCSS) and by finding a point where additional clusters result in less than a 1% relative reduction in WCSS, indicating a point of diminishing returns in model accuracy improvement. This threshold was chosen based on the stability it provides to the clustering process, balancing detail against computational efficiency. Multiple iterations of K-means clustering are executed to reduce the effect of initial conditions, which can produce variable outcomes.

Fig. 5
figure 5

Algorithm developed using K-means clustering and density-based sampling to generate the reduced order representative Euler angle datasets.

Following the optimal cluster determination, the algorithm performs density-based sampling within each cluster to calculate the density of data points, enabling the selection of a representative subset. This step ensures the preservation of the orientation distribution, as characterized by the ODF, in the reduced dataset.

The subsequent phase involves iterative refinement. During this phase, the ODF of the newly created smaller dataset is computed and compared with the ODF from the previous iteration. This ODF is evaluated by dividing the range of each Euler angle into bins (typically 10) and generating a histogram for each, reflecting the density distribution of the angles. The selection of representative data points from the clusters is adjusted iteratively, based on the ODF comparison. This process is repeated until the change in ODF between consecutive iterations falls below a relative threshold of 10%, which has been set to ensure the stability of the ODF while avoiding unnecessary computations.

Upon reaching this point of convergence, the algorithm concludes with a dataset that, despite being reduced in size, retains an accurate representation of the original dataset’s crystallographic texture. This balance ensures that the final dataset maintains the structural integrity required for subsequent analyses.

In addressing the complexity posed by multi-phase materials, the TACS approach has been tailored to provide users with flexible options that best suit their analytical needs. Recognizing the significance of phase proportions in the representative volume elements, the method allows for two distinct strategies: the selection of a single phase for focused analysis or the generation of a dataset that preserves phase fractions of the original dataset. This latter approach aligns with the hypothesis that the representative volume should reflect the same phase fractions as the actual material. As an illustrative example, in a material such as Ti-6Al-4V, which comprises both α and β phases, the TACS method can be employed to generate a dataset of 350 data points that mirrors the phase distribution of the original sample. This would result in the dataset containing 350 multiplied by the phase fraction of the α phase and 350 multiplied by the phase fraction of the β phase.

Probability distribution function mapping

The flowchart illustrated in Fig. 6 outlines the process for generating representative data points based on Euler angle probability distributions from an EBSD dataset, which is typically used for analyzing the crystallographic orientation of materials. The process begins by defining the dimension of the representative Euler angle space (N) and determining the resolution for the representative dataset. The Euler angles are then extracted from the EBSD dataset, normalized to valid ranges, and categorized into bins. A three-dimensional histogram of these binned Euler angles is constructed, which assigns weights to each bin, representing the frequency of the orientations. Next, Euler angle probability distributions (PDs) are calculated from the EBSD dataset, which are histograms showing the frequency of specific orientations (ϕ1, Φ, ϕ2) within the material. Once these distributions are calculated, weights are assigned within the N³ Euler space to mimic these PDs, effectively creating a weighted model that reflects the original orientation data.

Fig. 6
figure 6

The flow chart of the probability distribution mapping method.

Kernel density estimation using MTEX built-in functions

The flowchart presented in Fig. 7 illustrates approach for synthesizing a reduced-order representative dataset using MTEX built-in functions. Similar to the previous method, the process initiates with the specification of the number of representative data points (N), which defines the number of data points of the resultant Euler angle dataset. Leveraging the computational capabilities of MTEX, a MATLAB toolbox for texture analysis, the ODF is computed using the calcODF function. It can utilize different algorithms, including direct kernel density estimation (KDE), kernel density estimation via Fourier series, and Bingham estimation. It also allows the use of grain area as weights for orientations or a specific kernel function like SO3AbelPoissonKernel. The ODF can be computed as a Fourier series up to a specified order, with options to set the weights, halfwidth, resolution, and kernel function. This function is versatile for creating detailed and customized ODF representations from EBSD data. For this study, kernel density estimation, with a kernel size of five degrees was used to model the probability density of crystal orientations within a dataset.

Fig. 7
figure 7

The flow chart of the kernel density estimation method used to generate the representative Euler angles.

Following the establishment of the ODF, the calcOrientations function is used to draw a random N number of orientations using the generated ODF. The function essentially performs a probabilistic sampling from the ODF, attempting to ensure that the reduced dataset mirrors the comprehensive orientation characteristics of the material. The final output is a collection of representative data points, each comprising three Euler angles (ϕ1, Φ, ϕ2).