Introduction

Structural alloys are indispensable in modern engineering, serving critical roles across industries such as aerospace1,2,3, automotive4,5,6, and energy7,8,9. There is a continuous pursuit for novel alloys with enhanced properties10,11,12,13,14,15, including strength10,11, toughness12,13, and resistance to failure under various working environments14,15. Recent developments in complex alloys, e.g., high entropy alloys16,17 or multi-principal element alloys18,19, are actively enlarging the composition palette for alloy design, pushing the boundaries of traditional compositions. Concurrently, rapid advances in additive manufacturing (AM) have unlocked unprecedented control over alloy compositions, microstructures, and geometries at voxel-scale resolution while shortening supply chain of manufacturing, thus enabling further tailored solutions for demanding applications20,21. Together, these advancements are opening an expansive design space that consists of numerous composition choices and manufacturing procedures and hold promises for achieving combinations of superior properties. However, experimental fabrication and testing across the whole design space are extremely costly in resources as well as time22,23. Therefore, efficient yet affordable methods to navigate this broad design space for desired performances are of great interest, not only to advance scientific understanding but also to enable novel engineering applications with next-generation super alloys.

Recent progress in artificial intelligence (AI) has rapidly expanded the horizon of solving scientific problems and engineering applications using a data-driven methodology. Large language models (LLMs), e.g., GPT series from OpenAI24, LLaMA models from Meta25 and DeepSeek models26,27, trained on general language corpora have demonstrated intelligent capabilities in perceiving, understanding, and applying natural languages28,29 as well as coding languages30,31, thus assisting human researchers in accelerating and automating workflows in various tasks32,33,34. For solving scientific challenges, AI-based models, such as AlphaFold235,36,37 and RoseTTAFold38,39, can predict the atomic structures of proteins based on their sequences, reaching experimental accuracy but at a much-reduced cost36. On material design and engineering, end-to-end generative AI (GAI) models have been developed to design novel protein sequences that meet the desired structures or properties40,41,42. Similar explorations have been actively developed for various material systems43,44, including functional molecules45,46, polymers47,48, inorganic crystals49,50 as well as metallic alloys51,52. For structural alloys, various data analysis techniques and machine learning methods (e.g., correlation analysis, symbolic regression and transfer learning) have been applied to help predict important mechanical properties, including bulk modulus53, tensile strength54, and fatigue strength55. However, data on experimentally measured mechanical properties remains relatively sparse due to their time-consuming and costly nature. Since language models have unique strength of integrating data from various sources as general text and further being multi-modal, they are highly interesting specifically in the field of alloy design while in-depth studies remain rare.

In this work, we explore the possibility of solving the alloy design challenge from a language modeling perspective by leveraging GAI and alloy data. Specifically, we encode physics-rich alloy data as one-dimensional (1D) sequences of text and develop an attention-based56 LLM to capture the underlying patterns in this alloy-specific language. Our model, AlloyGPT, at deployment can accurately predict structures and properties for the given compositions and achieve R2 values ranging from 0.86 to 0.99 on the test set. When tested with compositions beyond the learned domain, we observe the accuracy tends to degrade but in a gradual and stable manner with respect to the distance in the composition space. The same model can also tackle alloy design challenges. For the same given property target, it generates various composition choices while maintaining high design accuracy. Through a sampling parameter, we can further boost the creativity of the model and tune the balance between the diversity of the suggested compositions and the accuracy of the delivered properties. Opening the model, we further unveil that AlloyGPT “thinks” with complex attention patterns between input and output information, some of which may hinder the underlying physics in the alloy systems. Those results of our numerical studies demonstrate that language modeling with the LLM specialized for alloys can provide accurate and robust representation of alloy physics. As a probabilistic model by nature, our model is particularly suitable for alloy design tasks with high degeneracy. For example, it can be exploited for designing gradient composition or microstructure in alloys. At each voxel of design, the possible degenerated solutions can provide candidates for further downselection based on local constraints, thus leading to improved overall performance and design metrics. These capabilities can be valuable for accelerating alloy discoveries in AM and reducing time and resource costs. Our methodology is expected to be generalizable for new alloys and other materials with broad design spaces, potentially leading to a foundation language model with comprehensive material knowledge and suitable for integrated multi-material designs or alloys with gradient compositions/microstructures in the future.

Results

An alloy-specific language dataset and AlloyGPT model

Natural languages are powerful tools that enable human communications. In scientific communities, research can be conducted in various forms, including experiments, simulations, and/or theories (Fig. 1a). Communication and documentation of the research are dominantly taken in the form of languages so far, either as written texts or oral speeches. Inspired by this universal flexibility, here we explore the possibility of using language-like 1D sequences to directly represent physics-rich data for alloys. Note that the intended speakers of such languages are LLMs, instead of humans.

Fig. 1: Convert physics-rich data as sentences of an alloy-specific language.
figure 1

a Curate scientific data on alloys from various sources, including experiments, simulations, and/or theories. b Example entries in the Al-based alloy dataset include quantitative information on the compositions (red rectangle), structures (green rectangle), and properties (blue rectangle) of the alloy samples. c, d Reformat an example data entry as a 1D sequence of information blocks, which is referred as a “sentence” of this alloy-specific language. Inside each block, the physical quantity is presented as a pair of key and value. c For forward prediction, the sentence is stated in the order of task type, composition (shaded in red), structure (green), and property (blue). d For inverse design, the sentence follows the order of task type, property (blue), structure (green), and composition (red).

Without loss of generality, in this work we adopt a database of Al-based alloys57 as an example of scientific data. The figures in our study related to this database are adapted from ref.57. The database has been developed and successfully applied to design additively manufacturable, high-strength aluminum alloys57,58. Using CALPHAD (CALculation of PHAse Diagrams) simulations with the experimentally validated database59, Al alloys with five alloying elements (i.e., Ni, Er, Zr, Y, and Yb) were studied using non-equilibrium Scheil solidification and single equilibrium calculations, modeling the as-built condition after rapid cooling in the laser-based AM process and the fully aged condition, respectively. The dataset is structured to group quantitative data of the compositions, phase structures, and properties of those Al alloys and embody the hidden physical relationship between them. The optimized composition design based on the database has been validated experimentally on their phase structures and enhanced properties57,60,61.

On the statistics of the database, 523,599 compositions were randomly sampled over the targeted composition space in terms of the alloying elements (i.e., Ni (0–4%), Er (0–2%), Zr (0–2%), Y (0–1%), and Yb (0 ~ 1%) in mole percentage (mol%)) using the Latin hypercube sampling method. As shown in Supplementary Fig. 1a, for the alloying elements, the sampled compositions show uniform distributions over the full targeted intervals while the main element, Al, compensates the remaining mol%. The sampled compositions are proven to be independent as the Pearson linear correlation coefficients are zero for any pair of the alloying elements (Supplementary Fig. 1b). On the phase structures, we focus on the four key precipitate phases that affect the mechanical properties of the alloys, including L12 structure (Al3M (M = Er, Zr, Y, Yb)) strengthening phase, metastable ternary phase (Al23Ni6M4, M = Er, Zr, Y, Yb), brittle Al3Zr phase with D023 structure, and relatively soft Al3Ni phase. The strengthening phase is called L12 phase for simplicity in this study. As shown in Supplementary Fig. 2, their mol% can be affected by the manufacturing process, thus being different under the as-built condition after the rapid solidification in laser-based AM (Supplementary Fig. 2a) and the fully aged condition (Supplementary Fig. 2b). On the properties, the dataset covers diffusion resistivity, misfit strain, coarsening rate metric, freezing range, crack susceptibility coefficient (CSC), and hot cracking susceptibility (HCS). Among these, the first three focus on thermal stability of the L12 phase that is important for high temperature performance; the later three are comprehensive indicators of the AM printability (crack free solidification) of the composition. For simplicity in comparison, we normalize the values of these properties with non-trivial units with respect to a benchmark printable Al alloy59 (i.e., Al-2.14Ni-1.15Zr-1.35Er-0.25Y-0.73Yb, wt.%). As shown in Supplementary Fig. 3, the sampled compositions cover broad intervals in the property space and show non-trivial distributions. With such a set of phase structures and properties, novel Al alloys with enhanced high temperature strength and free of hot cracking during printing have been discovered and experimentally validated57,60. Further details of this dataset, including the definition of all the properties, can be found in the “Methods” section.

To apply this dataset for our study here, we reformat the numerical entries from the dataset into “sentences” using an alloy-specific language proposed. Specifically, we group information according to underlying physics and order the information blocks based on the nature of the tasks. As shown in Fig. 1b, locally we group the physical quantities as text blocks on compositions (in the red dash rectangle), structures (green), and properties (blue) separately, given that compositions lead to specific phases and phase distribution affects the properties. Globally, in the sentence structure, we order those blocks by putting the given information ahead of the queried part. For example, for forward prediction, compositions are listed before structures and properties (Fig. 1c, sentence structure). While in inverse design, the sequence starts with properties which are followed by structures and compositions (Fig. 1d, sentence structure). To distinguish different tasks, we add a task block at the beginning of the sentences to label the task name. We also include simple signs to improve the readability of the sentences for human researchers (Fig. 1c, d, examples). Curly brackets enclose the individual blocks; square brackets enclose the list of physical quantities inside the block. The physical quantities are documented as individual pairs of “Key: Value”, where the key is a readable name tag of the quantity and the numerical value are expressed in a scientific format. To connect the blocks, we use the directional sign, “=>”, to indicate the reading or inferring direction and equal sign “=” for equivalence. With these simple rules, we can efficiently convert the numerical dataset of Al-based alloy into the corpus of sentences for different tasks. The statistics of the dataset for this alloy-specific language on Al-based alloys can be found in the SI (Supplementary Figs. 13). We expect these rules can be straightforwardly generalized to include more information blocks or applied to other alloys or material systems.

It should be noted that the new alloy-specific language dataset keeps almost all the information of the original numerical dataset. The key difference is that it enables us to view alloy prediction and design challenges as the same sentence completion task. Specifically, given only the beginning blocks, to accurately complete these sentences will be equivalent to predict alloy structures and properties based on the given composition, or to design alloy compositions that fulfill the given properties.

To handle such language tasks, we develop an attention-based56 autoregressive LLM, AlloyGPT, and test its performance for both forward prediction and inverse design tasks for alloys. AlloyGPT model adopts an architecture similar to GPT-262 with a tokenizer63 customized for the alloy language. To balance the efficiency and expressiveness of the tokenizer, we developed a character-word mixing tokenizer. We reserved 118 word-based tokens for all elements in the periodic table (i.e., “(A)” for element A) so that their names will not be divided during tokenization. Current LLMs are known to struggle with numbers due to lack of effective tokenization schemes64. It has been observed that single-digit tokenization outperforms other popular methods in learning arithmetic operations for numerical data65. Therefore, here character-based tokens are adopted for all numbers (0–9), supporting symbols (e.g., “+”, “-”, “.”, “=” and so on) and English alphabet (in lower and upper cases) to express arbitrary numerical values and keys. With additional token positions reserved for future usage, our tokenizer has a compact vocabulary size of 256. For the GPT model configuration, we take a medium size GPT-2 model as the prototype and modify its configurations to have an embedding dimension of 1024, 36 layers of multi-head attention (MHA) layers among which each has 16 attention heads. No special embedding strategy is applied for the numerical text. With about 453.35 million trainable parameters, we will demonstrate in this work that such an AlloyGPT v1.0 model with the customized tokenizer can learn and handle the proposed alloy-specific language effectively. We expect this model as well as the tokenizer can further learn other alloy systems given the flexibility of the tokenizer and the depth of the model. It should be noted that there also exists other advanced tokenization schemes for numerical data, including continuous encoding64 and enhanced positional embeddings66, which we reserve their integration with our model for future study. We train AlloyGPT from scratch by using the curated alloy-specific language database. Sentences for both forward prediction and inverse design are included and randomly mixed in the train batches. More details of the model architecture and training can be found in the “Methods” section and Supplementary Fig. 4.

Learn the composition-structure-property (C-S-P) relationships with AlloyGPT

To investigate whether our AlloyGPT model can learn the underlying composition-structure-property relationship through the alloy-specific language and autoregressive training, here we apply AlloyGPT model to solve forward prediction problems for Al-based alloys. Distinguishing from conventional methods, here we pose the problem as a language task. As shown in Fig. 2a, we provide a text prompt with the information blocks of the task type and the composition as the input. Our model is able to continue the sentence in the correct format of the alloy-specific language and finish it at the right sequence length. This behavior indicates that AlloyGPT can learn the grammar rules via training.

Fig. 2: Solve forward prediction tasks for the composition-structure-property relationship in alloys using AlloyGPT.
figure 2

a Pose the alloy predicting problem as a language task of sentence completion and solve it using AlloyGPT. Compare the predicted values and the ground truth for example physical quantities from the structure block (b) and the property block (c) using the test set. Better accuracy is indicated by a larger R2, smaller mean squared error (MSE) and smaller mean absolute error (MAE). d Compare the predicting accuracy of the property block using different input prompt length in the C-to-SP and CS-to-P tasks.

Quantitative results then can be extracted from the text. For the case depicted in Fig. 2a, the predicted values of the alloy phases and properties are close to the ground truth, thus solving this composition-to-structure and property (C-to-SP) task. Following this procedure, we performed this C-to-SP task for the whole test set and observed consistent high accuracy. Representative results of examples from the structure and property blocks are shown in Fig. 2b (as-built L12 phase) and c (hot cracking susceptibility), respectively. Good agreements are observed for both the overall distribution as well as the individual values between the prediction and the ground truth. R2 values larger than 0.95 are observed for as-built L12 phase mol% and normalized HCS. The positive results observed here indicate that language modeling can be competitive to conventional numerical methods (See the “Discussion” section for more details) in capturing the hidden pattern between composition, structures and properties in this Al-based alloy family, thus potentially opening new avenues to alloy research.

As an autoregressive language model67, AlloyGPT may process length dependence of its predicting accuracy and present convenient strategies to enhance its performance. Specifically, AlloyGPT completes the given sentences by regressively predicting the next token. Thus, the predicting error may accumulate and influence the accuracy of the content predicted at a later stage. In Fig. 2d, we plot the R2 values (blue bars for C-to-SP tasks) of all the physical quantities in their predicting order from left to right. Our model achieves high accuracy across various physical quantities with R2 values ranging from 0.86 to 0.99 (See Supplementary Table 1 for details). At the same time, the property block (shaded in blue) predicted later in the output shows overall smaller R2 values compared with the earlier predicted structure block (shaded in green). Therefore, to improve the accuracy of property predictions, one straightforward way could be to increase the input prompt length. For example, providing not only composition but also structure blocks for property prediction (i.e., a CS-to-P task) can reduce the prediction length for the language model. As shown in Fig. 2d, indeed, such CS-to-P tasks performed by the same model show improved accuracy with larger R2 values (red bars). This strategy also aligns well with a physics-based consideration. Because composition and structures together may better define the alloy and determine its properties. It should also be noted that there could exist other factors that affect the prediction accuracy. For example, by definition, CSC values are determined by the whole solidification process68, which makes it relatively challenging to predict. And we observe CSC has the lowest R2 value while it is not predicted at last.

Design Al-based alloys for target properties using AlloyGPT

As a language model of a probabilistic nature, AlloyGPT provides unique advantages in addressing the inverse design challenges for alloys. On one hand, AlloyGPT model unifies the forward predicting and inverse designing tasks under the same language task. From the format point of view, the language modeling task of sentence completion developed for the forward predicting tasks in the previous section can be straightforwardly applied to the inverse design tasks. As shown in Fig. 3a (left panel), the updated input prompt can include only the property information for a property-to-structure and composition (P-to-SC) design task, or both the property and structure blocks for a property and structure-to-composition (PS-to-C) design. Our model is tasked to complete the remaining of the sentence with the composition information included. On the other hand, the probability-based prediction of AlloyGPT model can be leveraged to capture degenerated solutions in alloy design challenges, which will be discussed later in this subsection.

Fig. 3: Generate alloy compositions based on property and structure information using AlloyGPT.
figure 3

a R2 values between the known physical quantities and the suggested ones for different inverse design tasks. When both property and structure information are provided as the input (i.e., PS-to-C tasks), higher R2 values are observed (red bars in (a)), and the compositions suggested by AlloyGPT are similar to the known cases in the test set (c). However, when only property target is provided in a P-to-SC task, the suggested compositions can be very different from the known ones (b). b, c Compare Y mol% suggested by AlloyGPT model and the known values from the test set for the P-to-SC (b) and the PS-to-C (c) design tasks, respectively.

To evaluate the performance of this unified pathway for solving design challenges, we perform P-to-SC and PS-to-C design tasks on the test set using AlloyGPT. By setting the properties of the data entries from the test set as the goal, we make sure there exists at least one composition as the feasible solution. The R2 values calculated using the known compositions from the test set and those suggested by our model are shown in Fig. 3a (right panel). When both properties and structures are provided as the input prompt or design goal, in the PS-to-C tasks the suggested compositions achieve high R2 values (red bars in Fig. 3a) (similar to those observed in the previous section for forward predicting tasks) and agrees well with the known compositions from the test set (taking element Y in Fig. 3c as the weakest example). This result demonstrates that AlloyGPT can “rediscover” compositions known to the test set based on conditioning of properties and structures.

However, when only properties are prescribed as the design goal in the P-to-SC tasks, the suggested physical quantities, for both structures and compositions, achieve low R2 values (blue bars in Fig. 3a). As shown in Fig. 3b, the suggested mol% of element Y tends to diverge from the known records and spread broadly. Thus, AlloyGPT can also suggest compositions different from the known ones.

It should be clarified that the R2 values of the compositions in Fig. 3a (right panel, shaded in red) should not be taken as the indicator of design accuracy. By definition, it represents the composition recoverability with respect to the test set, i.e., how similar the proposed compositions are to the known ones from the test set (Fig. 4a). Design accuracy, instead, should be evaluated by comparing the targeted goal and the delivered performance, i.e., the prescribed properties and the achieved values for the P-to-SC design task (Fig. 4a).

Fig. 4: Evaluate the design accuracy of AlloyGPT model in the P-to-SC design tasks.
figure 4

a Distinguish composition recoverability and design accuracy for alloy design tasks. We clarify the similarity between the suggested composition and the known one from the test set as composition recoverability, not design accuracy. Instead, design accuracy is evaluated by comparing the design goals and the delivered performances, i.e., the prescribed properties and the achieved properties for P-to-SC tasks. b Design accuracy and composition recoverability of AlloyGPT for the P-to-SC task. R2 values of the targeted properties and achieved properties (blue bars) are high and indicate good design accuracy. While the R2 values of the known composition and the designed ones (red bars) are low, suggesting low composition recoverability with respect to the test set. Comparison of the targeted values and the achieved values for some example properties, including coarsening metric (c) and hot cracking susceptibility (d). Candidates with low HCS values are heighted by a blue rectangle in (d) and can be promising to overcome hot cracking under rapid solidification in additive manufacturing.

To truly estimate the design accuracy, we first obtain the property and structure results of the designed compositions by applying the same protocol used for the initial dataset creation, and then calculate the R2 values between these and the input design targets. As shown in Fig. 4b, high R2 values (blue bars) are obtained between the input property goals and the achieved properties, indicating actually AlloyGPT also achieves high design accuracy for the P-to-SC design tasks. Comparisons of some example properties, including the normalized coarsening metric and HCS, are shown in Fig. 4c, d. Good agreement in both distributions and individual values are observed. At the same time, the low composition recoverability (red bars in Fig. 4b) suggests for the given properties, there could be different compositions as the design solutions. By achieving high design accuracy and low composition recoverability, AlloyGPT has demonstrated promising potential in addressing the challenges of degenerated solutions in inverse design problems and maintaining design accuracy and diversity at the same time.

It should also be noted that AlloyGPT addresses the inverse design tasks in an efficient end-to-end manner. Thus, it bypasses the iterative searching steps that are usually required and sometimes could be time-consuming in the conventional design methods using optimization57,69,70. For example, consider a task of designing for a low diffusion resistivity value of 0.1. As a traditional method, we adopted a Bayesian optimization method with a XGBoost model as the surrogate model to find possible solutions. It is observed that dozens of iterations (e.g., about 90 iterations as shown in Supplementary Fig. 5) may happen before the optimization reaches the solution. In contrast, by providing the targeted diffusion resistivity as the input prompt, AlloyGPT model can directly complete the sentence and provide the composition with high confidence, thus providing a concise and straightforward design strategy. At the same time, AlloyGPT remains computationally heavy and often relies on decent GPU hardware to produce smooth generating speed. Future studies may investigate how to further reduce the computational costs of such LLMs. Here, by solving forward prediction and inverse design tasks with the same model, our AlloyGPT presents unique strength compared to traditional machine learning techniques71.

Discussion

The results above all together have demonstrated that language modeling can provide a unified format to address the forward prediction and inverse design tasks in alloy research. Trained on physics-rich alloy data, our AlloyGPT model can capture the underlying composition-structure-property relationships and accurately predict the microstructural phases and properties at as-built and aged conditions. For inverse design, AlloyGPT model can not only produce accurate designs but also generate diverse compositions. As an initial step towards a language modeling powered material research direction, we expect our model and methodology to be integrated with many engineering applications and continuously improve with the growing dataset, enhanced language model architecture and deepened scientific understanding. To do that, some discussions on its robustness, tunability, interpretability, comparison, and expandability are in order.

We start the discussion by investigating the robustness of AlloyGPT when going beyond the learned domain. As a data-driven method, the capability of our AlloyGPT learned is strongly affected by the available data. For example, the data set adopted in the current AlloyGPT model covers a finite hypercube domain in the composition space57. For many practical applications, it is important to evaluate how the model performs when being pushed beyond its learned domain. To do so, we intentionally sample compositions beyond the learned domain and perform C-to-SP prediction tasks to evaluate the accuracy. As shown in Fig. 5a, we enlarge the mol% of elements Yb, Zr and Er by one-fold and randomly sample compositions from this enlarged region. Since our model has not been trained on any data from this domain, we borrow from biology and refer to these compositions as de novo compositions. We define the L2 distance of the sampled composition to the nearest boundary of the learned domain as the mutation distance, dM. As the mutation distance increases, we observe the variance of the relative L1 errors of the predicted properties gradually grow (Fig. 5b), indicating that the model loses the predicting capability when moving away from the learned domain. However, it should be noted that the variance increases in a gradual and stable manner. Therefore, in the thin neighborhood of the learned domain (e.g., dM < 0.5), the model is expected to still behave relatively well without dramatic increases in error. It is recommended to use caution when making predictions beyond this region. In the long term, retraining the model with a growing dataset for an enlarged composition space is expected to better address this issue.

Fig. 5: Test AlloyGPT beyond the learned domain with the prediction tasks.
figure 5

a Sample composition points beyond the learned domain as de novo composition test points and define the mutation distance. b Means and variances of the relative L1 error of property predictions for the de novo compositions of an increasing mutation distance.

We continue our discussion by studying the design accuracy and composition diversity of AlloyGPT. Contracting to the conventional deterministic models, AlloyGPT predicts the probability distribution for the next token based on the previous text and then samples it under this distribution62. Therefore, depending on the shape of the probability distribution, different outcomes in the token level can be observed when repeating the generation process with the same input or prompt. At the same time, this probability distribution can be rescaled by diving it with a sampling temperature parameter, which is termed as prediction temperature, Tp, in our AlloyGPT model. A Tp larger than 1 tends to flatten the probability distribution and boost diverse predictions. As shown in Supplementary Fig. 6a, repeating the same generation tasks can lead slightly different responses (rows 2 and 3); By increasing Tp to a higher value (e.g., 3.0), the generated sentence fails to follow the grammar rules of the alloy specific language and becomes unusable for quantity extraction.

Given those observations, it is important to understand the stability of AlloyGPT’s predictions and how to potentially leverage it for design purposes. We randomly sample 200 data entries from the test set to perform P-to-SC design tasks. For each input prompt, we repeatedly generate 20 designs at the same prediction temperature. This procedure is then performed for a series of prediction temperatures ranging from 0.001 to 3. It is observed that a prediction temperature, \({T}_{p}\le 2\), can produce readable outputs with a high probability (\(\ge 87 \%\)) and is recommended for real applications. For designs under such conditions, we analyze the diversity of the suggested compositions as well as the accuracy of the achieved properties using their coefficient of variation, CV, and relative L1 error, \({L}_{1}^{{{rela}}}\), respectively (the definition can be found in the “Method” section). To investigate the effect of prediction temperature, we grouped the results for different design targets at the same prediction temperature and show the mean values in Fig. 6. It is demonstrated that, as the prediction temperature increases, the mean CV values for all alloying elements generally increase (Fig. 6a), indicating boosted diversity of the composition designed. At the same time, the mean values of \({L}_{1}^{{{rela}}}\) for the achieved properties also show increase (Fig. 6b) clearly for Tp > 1, suggesting the design accuracy may decrease with the prediction temperature. It should be noted that, for many properties (excepting crack susceptibility coefficient and hot cracking susceptibility), the normalized error increases with the prediction temperature with a small and stable slope. These trends together indicate that AlloyGPT can generate diverse designs by its probabilistic nature. Considering practical applications, for design problems with intrinsic degeneracy, a 0 < Tp ≤ 1 often is enough to capture multiple solutions while maintaining design accuracy. As shown in Fig. 6, in this region, the design error (Fig. 6b) for all the properties remains flat with weak oscillations while there exist no-trivial variances in the generated compositions (Fig. 6a). On the other hand, for physical problems that may lack real degeneracy, to intentionally boost diversity and explore possible solutions, a Tp between 1 and 2 can be tried. However, now the suggested compositions may need further screening for design accuracy. Both AlloyGPT model as the forward predictor and CALPHAD-based simulations can serve as the initial screen tools while the final validation is reserved for experiments.

Fig. 6: The effects of prediction temperature on the diversity of compositions and the accuracy of the achieved properties in solving P-to-SC design tasks using AlloyGPT.
figure 6

a The diversity of the suggested compositions at different prediction temperatures. b The design error in the delivered properties of the compositions suggested by AlloyGPT at different prediction temperatures.

For the inverse design task, we clarify that the seemly trade-off between diversity and accuracy doesn’t necessarily always exist. For degenerated design problems (e.g., for the given set of properties, there exist multiple compositions that can achieve it), solving it successfully can naturally achieve diversity in compositions and accuracy in properties at the same time. Our AlloyGPT model is trained to learn the joint probability distribution of all possible compositions under the given property prompt. Thus, for example, when there exist two composition choices that can lead to the same property, the well-trained AlloyGPT model tends to produce a probability distribution in which both candidates have comparable high probability and can be chosen during multiple attempts. That is the underlying reason for what we observed in Fig. 4 (with Tp = 1), where AlloyGPT model achieves both accuracy and diversity due to the degenerate nature of the alloy system itself, not the choice of Tp. As shown in Fig. 6, for 0 < Tp ≤ 1, the design error (Fig. 6b) remains flat with some oscillations while there exists nontrivial diversity in the proposed compositions (Fig. 6a). Therefore, mainly for Tp > 1, there exists a trade-off between accuracy and diversity.

For forward prediction tasks, it should be noted that there are no degenerated solutions by nature. Therefore, the high accuracy in forward prediction tasks over the test set (Fig. 2 with a trivial Tp = 1) indicates that as a probabilistic model by nature, AlloyGPT can behave like a deterministic model when there is little degeneracy in the physical problem itself. To test stability and accuracy in the forward predictions explicitly, we randomly selected 50 compositions, applied AlloyGPT model to perform C-to-SP predictions 10 times for each composition, and calculated the relative L1 error of the predictions. The means and variances of relative L1 error for phase structures and properties are shown in Supplementary Fig. 7. For reliable predictions (e.g., mean relative L1 error being less than 0.06), the observed variances remain small (i.e., less than 0.01). While for those less accurate predictions (e.g., predictions on AsBuilt_TernaryMol% and AsBulit_Al3NiMol% in the randomly selected cases), relative larger variances (0.02–0.04) are observed. Combining the results, we observed that for reliable predictions, AlloyGPT model tend to generate stable and consistent results for forward predictions, even though it is a probabilistic model by nature.

Next, we focus our discussion on understanding how the AlloyGPT model reasons. Built upon the high prediction accuracy, our AlloyGPT model brings in unique opportunities to investigate inside the trained model and try to understand how it reasons, i.e., using provided information, trained parameters to form predictions.

To understand how the AlloyGPT model reasons when making predictions, we opened the model and analyzed the attention weights between information blocks. As the backbone of the AlloyGPT model, the MHA layers are expected to play key roles in learning the underlying relationships between tokens. The attention weights calculated within them often provide insights to understand the “thinking” process of LLM for natural language applications72. Here, we focus on a simple representative test, in which given the composition AlloyGPT model predicts the mol% of L12 phase under the as-built condition. The input prompt and the output are shaded in pink and blue in Fig. 7a, respectively. After the prediction is done, we extract and analyze the attentions weights from the individual heads and MHA layers. It should be noted that the original attention weights are formed between individual tokens. Here, for clarity, we group the raw individual tokens into information blocks (see rectangles in Fig. 7a) and coarse-grain the attention weight matrix accordingly. For individual heads, some clues found in the attention weights can suggest reasoning paths that agree with the underlying physics. For example, in the last MHA layer (i.e., Layer 35), some attention weights between the L12 phase information block (shaded in gray in the left columns) and all information blocks (shaded in color in the right columns) are shown in Fig. 7b, c as the color depth of the connection lines. In Head 1 (Fig. 7b), relatively high attention is paid to information blocks of element Zr and Yb while in Head 10 (Fig. 7c), among elements, relatively high attention is received by element Er and Y. Those attention priorities agree with the hidden physics that Zr, Yb, Er and Y are all potential elements that can form L12 phase and indicate that AlloyGPT model has learned the reasoning path to relate them strongly with the L12 phase information block via training.

Fig. 7: Visualize attention weights as AlloyGPT makes predictions.
figure 7

a A representation test where AlloyGPT predicts L12 phase mol% under the as-built condition for the given composition. The input prompt is shaded in pink and the output prediction in blue. For clarity, the tokens in the text are grouped into information blocks as indicated by the rectangles. b, c Attention weights between the output information block and all information blocks are highlighted as the depth of the colors. For the last multi-head attention layer (i.e., Layer 35), in Head 1 the output L12 information block pays high attention to elements Zr and Yb (b) while in Head 10, it pays high attention to elements Er and Y (c). Note that Zr, Yb, Er and Y are all potential elements for forming L12 phases with the main element Al. d Attention weights between all information blocks for the 16 heads in all 36 multi-head attention layers are visualized in a bird view and a collection of complex patterns are observed. AlloyGPT model is expected to combine multiple attention patterns to collectively reason and make the prediction for the next token.

It should also be pointed out that not all attention weights can be directly correlated to known physics and there exists a collection of different attention patterns across the parallel attention heads and serial MHA layers inside AlloyGPT. For instance, in Fig. 7c, the L12 phase information block pays the highest attention to the beginning information block of the text. In Fig. 7d, the attention weights between all information blocks are shown for the 16 heads through the 36 MHA layers. It can be observed that the attention pattern in each head (See each column) evolves across the layers; in the final MHA layer, the patterns from the 16 heads present strong diversity. It is the collective contribution of these patterns that AlloyGPT adopts to make the prediction of the next token, indicating an intertwined reasoning pathway. This complexity makes it challenging to provide a clear or simple explanation about how AlloyGPT reasons for the prediction but echoes the known tradeoff between the expressiveness and interpretability of language-based models73.

To quantitatively investigate the “thinking” pattern of AlloyGPT on relating the inputs and outputs, we further investigate the influence of the input features on the predictions of the AlloyGPT model by calculating the input saliency74 and input attributions75. Input saliency has been used to distinguish the importance of the input features in the prompt on the output texts of LLMs by evaluating the derivatives of model output with respect to the inputs76. Here, we leverage the back propagation mechanism and an integrated gradient method to calculate the input saliency for AlloyGPT when predicting the mol% of the as-built L12 phase (Fig. 8a–c) for the given input composition. As shown in Fig. 8a, when starting to predict the first token “+” in the numerical result, the information block “AsBuilt_L12Mol%” demonstrates the highest input saliency, indicating that AlloyGPT first identifies what phase is considered at this moment; when predicting the leading tokens that strongly affect the numerical value (e.g., “6” in b and “2” in c) given their positions, relatively high input saliencies are observed at information blocks of elements Er and Yb (Fig. 8b, c), suggesting that these two elements strongly affect the value of L12 phase mol%. This result agrees well with the high Pearson coefficients observed between the printed L12 phase mol% and the two elements among all alloying elements reported recently58. These and similar observations strongly indicate that learning solely on data, AlloyGPT can form “thinking” patterns that pay uneven considerations to information blocks in the input prompts to form specific predictions; some of these patterns resonate with some known physics of alloys while some may prompt new understanding of the underlying physics. To understand how the final choices of the output tokens evolve through the depth of the AlloyGPT model, we trace the rankings of those tokens after each MHA layer. As shown in Fig. 8d, the important numerical tokens (e.g., “6”, “2” and the last two ‘0’s highlighted by the blue rectangles) often only become the clear 1st choice within the last three layers; while other supporting tokens which usually remain unchanged (e.g., “e” and “+” highlighted by the orange rectangles) emerge as the top choice earlier (at Layer 21 for “e” and at Layer 2 for “+”) and keep the position thereafter. These differences in the evolution of the decision making inside AlloyGPT indicate there exists a diverse yet comprehensive “thinking” process within the multiple MHA layers to reach the conclusion for the choice of tokens at different positions.

Fig. 8: Quantitative evaluations of how input features affect the output of AlloyGPT.
figure 8

ac The input saliency (between 0 and 100%) is calculated as the normalized derivatives of the predicted tokens (with thick underscores, “+” in (a), “6” in (b) and “2” in (c)) with respective to the preexisting information blocks using a back propagation gradient method. d The evolution of the rankings of the predicted tokens (listed along the x axis) within the AlloyGPT model through the 36 multi-head attention layers (listed along the y axis). Blue rectangles highlight some tokens that only are finalized at the last three layers while the orange rectangle highlights those that are chosen in earlier layers and remain unchanged. e The attributions of the input features of alloying elements to the predicted phases and properties are calculated using a perturbation method for many input compositions. For each row of prediction, only the relative rankings of the attribution values of the alloying elements matter.

While gradient-based input saliency provides rich details about the influence of input features, it is often expensive to calculate and demands heavy computational resources (e.g., to calculate and store the gradients across the layers in LLMs) as the predicted sequence becomes longer. To collectively evaluate the influence of many compositions on the full predictions of the phases and properties, here we switch to an affordable perturbation method to calculate the LLM attributions75 for AlloyGPT in forward prediction tasks. The input attribution is identified as the change of the direct output of AlloyGPT model (i.e., the probability of tokens) due to the ablation of the information blocks in the input prompts and only the relative values matter. As shown in Fig. 8e, the attributions of the alloying element information blocks on the predictions of phases and properties are evaluated and averaged over 1000 randomly selected compositions from the learned composition domain. The observed attributions demonstrate intriguing patterns between inputs and outputs, which may suggest new clues about the composition-structure-property relationship in the alloys. For instance, for Al3Ni phases in both as-build and aged condition, element Ni shows the highest attribution, which agrees well with the physical intuition. On the printability indicators, freezing range and CSC, element Yb shows the strongest attribution, which matches well with the results of Pearson coefficient evaluation purely based on data analysis57. Thus, future alloy research may take this as a working hypothesis and make plans to validate and understand the potential hidden mechanisms via physics-based modeling and experiments. It should be noted that it remains unclear whether or how the AlloyGPT captures the meaning of the numerical texts correctly or similarly as human does. Future studies may combine the observations on the interpretability of AlloyGPT here to formulate affordable methods to systematically define and investigate the sensitivity of AlloyGPT or similar LLMs trained in alloy-specific language or other data-intensive languages.

We continue our discussion by briefly comparing AlloyGPT with conventional methods. With the AlloyGPT model, our work here aims to introduce a novel language-based modeling approach for alloy prediction and design and highlight its unique advantages, especially over conventional methods. It is well recognized that conventional methods, including classical machine learning models, can excel in many alloy prediction tasks. For example, we have constructed and optimized a decision tree model using the XGBoost (eXtreme Gradient Boosting) algorithm with multiple outputs to solve the C-to-SP tasks and compared its performance with our AlloyGPT model. As shown in Supplementary Table 2, after training with the equivalent dataset, the XGBoost method can achieve satisfying accuracy for predictions on both phases and properties (with R2 values range from 0.79 to 0.99), similar to the performance of AlloyGPT. Also, as a lighter model, the XGBoost method demands less computational resources for training (e.g., minutes in CPU times with XGBoost vs tens of hours with GPU computing for AlloyGPT) and inferring. However, with AlloyGPT, we observe some improvements in the prediction accuracy as shown in terms of the mean absolute errors (MAEs) in the last two columns in Supplementary Table 2.

More importantly, AlloyGPT model presents a few novel features that conventional methods rarely show. (1) Overloading functionalities within one unified model. Treating as sequence completing tasks, AlloyGPT can solve both forward predictions and inverse designs with one set of model weights, which is rarely achieved with conventional physical-based model and classical machine learning methods. Given the depth of attention layers in the model, we expect more alloy systems and tasks can be overloaded into the same AlloyGPT instance via continuous learning, which makes it a compelling candidate for a foundation model for alloy research. (2) Handling degenerated problems via the probabilistic nature. Facing degenerated problems with multiple solutions, conventional methods often require extra efforts to recover different solutions, like multiple searching attempts with different staring points. However, AlloyGPT learns the underlying probability distributions of the next token to be predicted and can naturally generate different solutions that process comparable probabilities. These two unique features distinguish AlloyGPT and similar language-based models from the conventional methodologies and open new possibilities in alloy and materials science research.

At the same time, it is acknowledged that there exist different pathways to train LLMs. In this work, we trained the AlloyGPT model from scratch so that it can serve as a clean benchmark to demonstrate its novel potential in alloy research. For other applications, one may prefer to teach the alloy-specific language proposed to LLMs that are pre-trained in natural languages (e.g., open sourced LLaMA25 and DeepSeek26 models) using various fine-turning strategies77,78. We positively expect comparable performance as well as new functionalities may emerge from those models and reserve the in-depth investigation as future study.

Finally, we discuss the potential expansion and further development of our AlloyGPT model. We have developed AlloyGPT model with its applications to additive manufacturable Al-alloys in mind. Therefore, we have intentionally selected printability (i.e., crack-free additive manufacturing) indexes as part of the properties of interests. It is well known that high strength Al alloys (e.g., Al7075) are often prone to hot cracking during the rapid solidification in AM79,80. Thus, we include freezing range, CSC and HSC in the database to evaluate the printability81 of various compositions. After training, AlloyGPT model can be applied to screen or suggest compositions that are optimized for printability in AM. Built upon the validated reliability of AlloyGPT, one can start with compositions generated by AlloyGPT under the prompt with low values of HCS (e.g., points highlighted in the blue rectangle in Fig. 4d), down select based on additional criteria, and then perform experimental validation. Going beyond, other parameters that are key to AM can be further integrated into future AlloyGPT versions. Given the learning capabilities of AlloyGPT models and expressiveness of the alloy-specific language, we expect information on the AM processing82,83 (e.g., laser power, scanning speed, hatch spacing, etc.) can be further integrated into the “sentences” as a new block of processing information and let AlloyGPT model to learn the hidden relationship between composition, process, structures and properties. Facing the inevitable noise and deviations in real experiments, we expect uncertainty quantification techniques58,84 can also be applied to gain deeper understanding of the trained AlloyGPT model as well as the alloy system and provide guidance for more robust designs that remain insensitive to various uncertainties in realistic engineering scenarios. We leave in-depth investigation along this direction for future study.

Our current protype AlloyGPT model has been trained on Al-based alloys. Given the generality of the alloy-specific language, more numerical data of relevant physics and other alloy systems can be easily reformatted as new text corps with no or little updates. GPT architecture and models similar to our AlloyGPT have demonstrated promising scaling law85,86 in learning from more training data and achieving better performances. Combining these two strengths, we expect our AlloyGPT model to be further improved by training with enlarged text corpora for different alloys and materials. Not only numerical data but also text statements or descriptions69 can be integrated at the same time in this language modeling process. This mixture may help to take advantage as much as possible of the available data in different formats and with various fidelity. Going beyond Al-based alloy, it will be interesting to let the AlloyGPT model learn multiple alloy systems and investigate its extrapolation capabilities in prediction or design for novel hybrids of alloys as well as alloys with spatial gradients87, thus building towards a foundation model for alloys. We consider these studies as promising research directions in language modeling-based materials research and design.

In summary, we introduce AlloyGPT, an autoregressive language model specifically tailored to alloy design and prediction tasks. By encoding comprehensive alloy data into an alloy-specific language, our approach bridges the gap between traditional numerical modeling and modern GAI techniques. AlloyGPT successfully learns the intricate C-S-P relationships within alloys, demonstrates high predictive accuracy and robust inverse design capabilities for Al-based alloys, and suggests promising clues of alloy physics via intriguing attention patterns. Beyond its ability to discover known compositions, the model excels in generating diverse alloy designs that achieve targeted properties with high accuracy, effectively addressing the degeneracy challenge in inverse design.

Our results highlight the versatility of AlloyGPT in navigating the vast compositional design space of additively manufacturable alloys, with gradual degradation in performance observed only when operating outside the learned domain. The probabilistic nature of the model allows multiple composition designs to achieve the given design goal. Through parameters such as prediction temperature, the “creativity” of the model can be further boosted by tuning the balance between design diversity and accuracy, thus providing a flexible tool for alloy discovery. These findings underscore the potential of language modeling as a unified framework for both forward prediction and inverse design in materials science.

While this work establishes a foundation for integrating GAI into alloy and material design, we expect the current model can incorporate more knowledge with more training tokens. With a size similar to that of the medium version of GPT-2 model, AlloyGPT v1.0 is expected to be trained with tens of billions of tokens88,89. However, the data set we created here provides about 0.5 billion tokens, which is relatively small. It remains challenging to curate large, consistent and high-quality alloy-specific datasets. We plan to continuously grow our alloy dataset by including more information blocks (e.g., blocks that describe processing conditions) and covering more alloy systems. With such a growing dataset, we expect to systematically test the model performance with various combinations of model configurations and training data sizes, in hope to understand the scaling law of AlloyGPT models, trigger emergent capabilities90 and achieve optimal performances91. Therefore, future efforts may focus on expanding the training datasets to include diverse alloy systems and processing parameters, enhancing model architectures, and integrating textual and numerical data for a more comprehensive understanding. By leveraging the scalability of language models and the generalizability of the alloy-specific language, AlloyGPT is expected to serve as a cornerstone in the development of next-generation alloys, accelerating innovation and reducing resource-intensive experimental workflows.

Methods

The Al-based alloy database

As a prototype case study, in this work, we adopt a database on Al-based alloys with five alloying elements, including Ni (0–4%), Er (0–2%), Zr (0–2%), Y (0–1%) and Yb (0–1%), based on CALPHAD simulations using Thermo-Calc92. In a previous study57, this database has been used to design printable Al alloys with high strength, which have been validated via experiments. Focusing on AM applications, we document microstructures and properties for both as-built and fully aged conditions. The key microstructure features include mol% of L12 phase, ternary phase, Al3Ni phase, and Al3Zr phase, while the properties cover diffusion resistivity93, misfit strain, coarsening rate metric, freezing range, CSC, and HCS.

Diffusion resistivity is defined as

$$D=\mathop{\sum }\limits_{i=2}^{N}\frac{{\left({\bar{C}}_{i}^{\beta }-{\bar{C}}_{i}^{\alpha }\right)}^{2}}{{M}_{i}}$$
(1)

where \(i=2,\ldots ,N\) refers to the alloying elements, \({\bar{C}}_{i}^{\alpha }\) or \({\bar{C}}_{i}^{\beta }\) refers to the average concentration of alloying element i in the precipitate phase \(\beta\) or the matrix phase \(\alpha\), \({M}_{i}\) refers to the diagonal component of the mobility matrix for element i. Misfit, \(\bar{\epsilon }\), is one minus the ratio between the average lattice parameters of the precipitate phases and the matrix phase. The coarsening metric, CM, is defined as the ratio of misfit strain and diffusion resistivity as the following.

$${CM}=\frac{\bar{\epsilon }}{{\sum }_{i=2}^{N}{\left({\bar{C}}_{i}^{\beta }-{\bar{C}}_{i}^{\alpha }\right)}^{2}/{M}_{i}}$$
(2)

Freezing range, FR, is measured as the temperature difference between the states that the alloy is completely liquid and 99% solidified as

$${FR}={T}_{0}^{S}-{T}_{0.99}^{S}$$
(3)

where \({T}_{0}^{S}\) and \({T}_{0.99}^{S}\) are the lowest temperatures that the system is fully liquid and 99% solid, respectively. CSC introduced in ref. 94 is defined as

$${CSC}=\frac{{t}_{v}}{{t}_{R}}$$
(4)

where \({t}_{v}\) is the normalized cracking-vulnerable time corresponding to a liquid volume fraction between 0.1 to 0.01 and \({t}_{R}\) is the normalized relaxation time with a liquid volume fraction between 0.6 to 0.1. HCS is defined as

$${HCS}={FR}\times {CSC}$$
(5)

To ensure printability, compositions that minimizing FR, CSC or HCS are preferred. For these physical quantities with nontrivial units, we normalize their values with respect to those of the benchmark alloy, i.e., “Alloy 1” in ref. 59. The distributions of these quantities are shown in Supplementary Fig. 13. More details can be found in ref. 57.

The alloy-specific language

The alloy-specific language proposed in this work plays the key role in representing scientific data in alloy research. Different from many natural languages, materials science data are numerical intensive and bear physics-based causality in between. We design our alloy-specific language and grammar to make it expressive, efficient and flexible. Exemplified in Fig. 1c, d, the key-value pairs can express many quantitative features in alloy research and manufacturing; the scientific notation adopted for the value part covers a broad range (i.e., from −9.999e+99 to +9.999e+99) with a flexible resolution (i.e., ranging from 0.001e-99 to 1.000e+99). The grouping of texts on the composition, structure, and property parts make it straightforward to encode the logic flow of forward prediction and inverse design using the sequence order. It is modular and thus straightforward to include more features and details (e.g., including text blocks on the processing condition, additional microstructural features, and more properties) while remaining readable for human researchers.

AlloyGPT model and training

AlloyGPT includes 36 layers of MHA layers and ~400 M trainable parameters. The full structure of the model is shown in Supplementary Fig. 4a. We augment a character-level tokenizer to include words for all elements and signs used in the alloy-specific language. The alloy-specific language dataset includes sentences for both forward prediction and inverse designs. To prepare it, we start with the 523,599 compositions with their structures and properties and randomly separate them into a training set (90% with 471,239 data entries) and a test set (10% with 52,360 entries). For training/testing, each data entry is converted into two sentences with a composition-structure-property and property-structure-composition order using the alloy-specific language, respectively (as shown in Fig. 1c, d). For example, 942,478 sentences corresponding to the 471,239 data entries are used to train the AlloyGPT model. We train the AlloyGPT model from scratch on the training set for 4 epochs with an AdamW optimizer95 using a NVIDIA A40 GPU. No overfitting has been observed (Supplementary Fig. 4b). We developed our code based on nanoGPT96

Design accuracy and diversity evaluation

We use relative L1 errors to measure the design accuracy for the properties.

$${L}_{1}^{{rela}}\left[x,\,y\right]=\frac{\left|y-x\right|}{\left|x\right|}$$
(6)

where x is the ground truth or input value, and y is the achieved value based on the designed composition.

For the repeated designs with the same input target and a fixed prediction temperature, we use the coefficient of variance, CV, to measure the diversity of the suggested compositions.

$${CV}=\frac{\sigma }{\mu }$$
(7)

where \(\sigma\) is the standard deviation and \(\mu\) is the mean of the designed mol% of the individual elements.