A multimodal large language model for materials science

Tang, Yingheng; Xu, Wenbin; Cao, Jie; Gao, Weilu; Farrell, Steven; Erichson, Benjamin; Mahoney, Michael W.; Nonaka, Andy; Yao, Zhi Jackie

doi:10.1038/s42256-026-01214-y

Download PDF

Article
Open access
Published: 24 April 2026

A multimodal large language model for materials science

Nature Machine Intelligence volume 8, pages 588–601 (2026)Cite this article

18k Accesses
5 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Understanding and predicting the properties of inorganic materials is crucial for accelerating advancements in materials science and driving applications in energy, electronics and beyond. Integrating material structure data with language-based information through multimodal large language models (LLMs) offers great potential to support these efforts by enhancing human–artificial intelligence interaction. However, a key challenge lies in integrating atomic structures at full resolution into LLMs. In this work, we introduce MatterChat, a versatile structure-aware multimodal LLM that unifies material structural data and textual inputs into a single cohesive model. MatterChat uses a bridging module to effectively align a pretrained universal machine learning interatomic potential with a pretrained LLM, reducing training costs and enhancing flexibility. Our results demonstrate that MatterChat greatly improves performance in material property prediction and human–artificial intelligence interaction, surpassing general-purpose LLMs such as GPT-4. We also demonstrate its usefulness in applications such as more advanced scientific reasoning and step-by-step material synthesis.

Enabling large language models for real-world materials discovery

Article 10 July 2025

A family of large language models for materials research with insights into model adaptability in continued pretraining

Article 27 February 2026

1.5 million materials narratives generated by chatbots

Article Open access 28 September 2024

Main

In silico material discovery traditionally relies on high-fidelity methods like density functional theory¹ and ab initio molecular dynamics². However, prohibitive computational costs limit their scalability for high-throughput screening. Moreover, many advanced materials lack the mechanistic understanding due to complex compositions and phase instabilities. Consequently, breakthroughs in functional materials, such as correlated oxides^3,4 and quantum materials^5,6, have often been serendipitous rather than driven by theory. Achieving reliable, scalable and predictive design of materials requires a paradigm shift.

With the rise of AI in materials science, there has been a surge of methods aiming to overcome these limitations, ranging from surrogate models^7,8 to MLIPs^{9,10,11,12,13} and generative models^14,15. These models enable rapid predictions, accelerate large-scale simulations and facilitate the generation of novel materials. As a result, they have greatly advanced fields such as energy storage¹⁶, electronics¹⁷, catalysis¹⁸ and biomedical applications¹⁹. Among these promising ML approaches, graph-based models in materials science have become increasingly popular due to their versatile graph representation of atomistic systems in which each atom is represented as a node and chemical bonds to neighbouring atoms are represented as edges. Although these graph-based methods have shown success in accurately predicting material properties, they typically lack the capacity to handle tasks that require understanding scientific context, literature-based insights and domain-specific language²⁰. In particular, these models do not support human–AI interaction through user prompts or textual descriptions, making it difficult to incorporate expert domain knowledge and user-specified requests to close the feedback loop.

This bottleneck has inspired exploration into large language models (LLMs). LLMs like BERT²¹, GPT²², Mistral²³, Llama²⁴ and DeepSeek²⁵ have shown promise in scientific question–answer²⁶ and information retrieval²⁷. Recent efforts have incorporated LLMs to solve materials problems^28,29 by leveraging pretrained or multimodal architectures.

Recent benchmarks, including MatSci-NLP³⁰, MaScQA³¹, HoneyBee³² and others^{33,34,35,36,37}, provide valuable baselines for evaluating domain-specific reasoning. However, these methods primarily rely on text-based representations—such as chemical formulas²⁸, SMILES strings^29,38, and Crystallographic Information Files (CIF)³⁹. While informative, these textual inputs often fail to explicitly capture the complex 3D spatial relationships and local environments inherent in atomic structures. Consequently, they exhibit inferior property prediction performance compared with graph-based models⁴⁰. Universal MLIPs¹¹ now allow for the extraction of rich structural information from atomistic embeddings, offering a feasible pathway for multimodal integration.

In this work, we present MatterChat, a multimodal LLM for materials science. MatterChat utilizes a modular framework that bridges pretrained language and materials models. By freezing the weights of the LLM and the material encoder, our system enables plug-and-play flexibility with components like CHGNet⁴¹ or many-body atomic cluster expansion (MACE)¹¹. This design preserves foundation model generalization and facilitates future extensions without retraining the entire architecture. MatterChat integrates structure data with textual queries, overcoming traditional LLM limitations in quantitative prediction. It maintains robust human–AI interaction and enables advanced reasoning for synthesis guidance. Embedding analysis confirms that MatterChat effectively preserves structure–property information, supporting a multimodal retrieval-augmented generation (RAG) approach to enhance inference robustness.

Results

Overview of MatterChat

Figure 1a presents the architecture of MatterChat, designed to process both material structures and user requests as inputs to generate text-based outputs for tasks such as material property prediction, structural analysis and descriptive language generation. MatterChat consists of three core components: the material processing branch, the language processing branch and the bridge model. The material processing branch extracts atomic-level embeddings from material structures represented as graphs. These embeddings are then processed by the bridge model, which uses trainable queries to produce language model-compatible embeddings. Finally, the language processing branch processes the user’s text-based prompt (for example, ‘What is the formation energy of the material?’) into language embeddings. These embeddings are then combined with the query embeddings generated by the bridge model and fed into the LLM to produce the final output in text format. Below, we provide the details of each component.

Material processing branch

The material processing branch encodes material structures as graphs that capture the atomic local environment. We specifically utilize the encoder modules of state-of-the-art graph-based universal MLIP models, such as CHGNet⁴¹ and MACE¹¹, as feature extractors to process these graphs. These encoders are pretrained on a diverse dataset of materials, encompassing a wide range of symmetries, compositions and bonding types, enabling it to effectively model complex atomic interactions and structural details. By capturing essential compositional features, such as atomic types and chemical bonds, along with spatial features like bond angles, these pretrained encoders generate high-quality atom embeddings that are both physically meaningful and well suited for downstream tasks.

Language processing branch

The language processing branch is used to process the user’s text-based prompts, such as requests for property predictions, chemical formulas, space group information or other material characteristics. We use the Mistral 7B LLM²³, one of the latest open-source LLMs, chosen for its exceptional performance across a wide range of scientific and non-scientific tasks. This branch processes each prompt, transforming it into dense embeddings that capture the semantic content of the enquiry. These embeddings are then combined with the query embeddings processed by the bridge model using a structured fusion approach, allowing the model to effectively incorporate both textual and material information. This integration enables the LLM to generate precise and contextually relevant responses tailored to the user’s specific material-related prompts.

Bridge model

To facilitate the integration between atom embeddings and the language processing branch, we developed a bridge model inspired by the BLIP2 architecture⁴² based on a multilayer transformer framework. This bridge model includes 32 trainable query vectors that interact with atom embeddings using an alternating attention mechanism. Cross-attention in even-numbered layers extracts key features from the atom embeddings, whereas self-attention in odd-numbered layers enhances representational depth. This approach refines the atom embeddings into query embeddings that are most connected to text (Fig. 1a). Finally, these refined representations are mapped to LLM-compatible embeddings via a linear projection layer.

Figure 1b,c provides an overview of the dataset of crystalline structures used in our training set. Figure 1b visualizes the material distribution on the periodic table, highlighting that the dataset evenly spans a diverse range of elements up to plutonium. Figure 1c depicts the distribution of crystalline structures by space group across the dataset. The dataset was curated from the Materials Project⁴³ and contains 142,899 material structures. For each structure, we generated a corresponding text-based dataset encompassing 12 tasks: three descriptive tasks (chemical formula, space group and crystal system) and nine property prediction tasks. These property prediction tasks include metallicity, direct bandgap, stability, experimental observation, magnetic status, magnetic order, formation energy, energy above the hull and bandgap (Fig. 1a). Further details regarding the training scheme, hyperparameters and dataset curation are provided in Methods.

Figure 2 illustrates examples of a human–AI interaction with MatterChat across a diverse range of material property prediction and analysis tasks. It shows MatterChat’s ability to effectively address a broad spectrum of user prompts ranging from fundamental material attributes (for example, chemical formulas, space groups and crystal system) to complex material properties (for example, thermal stability, bandgaps, formation energies and energy above the hull). Figure 2a shows three interactive examples of material property prompts from randomly selected materials from the Materials Project database. The top left panel presents a human–AI query interface with MatterChat for the material with an mp-id of mp-1001021. It provides a detailed profile including the chemical formula Y₂Zn₄Se₂, its crystalline structure denoted by the space group Fd-3m, and electronic properties such as a bandgap of 0.23870 eV. The interface also addresses the material’s lack of thermal stability. The top middle panel shows the interaction example with the material with an mp-id of mp-1028281. It provides a comprehensive breakdown of the material’s composition attributes, including its chemical formula (Mg₁₄VSb) and its space group (Amm2). The interaction further predicts that the material is both magnetic and metallic, and its formation energy is estimated at 0.07219 eV per atom. The top right panel provides an interaction example with MatterChat of the material with an mp-id of mp-10198. This panel informs the user’s query about the chemical composition ${{\rm{Mn}}}_{3}{\rm{PdN}}$ and its cubic crystal structure, with the space group classified as Pm-3m. Additionally, it estimated that the material possesses an indirect bandgap, which is an important characteristic for applications in electronics. MatterChat also accurately predicts the ferromagnetic magnetic behaviours that the material exhibits, and it mentions its energy above hull value at 0.01357 eV per atom. In the bottom panel, we present a comparative evaluation of MatterChat’s performance on formation energy evaluation tasks for newly discovered materials from GNoME⁴⁴. The model was compared against commercial LLMs, like Gemini⁴⁵, GPT-4o⁴⁶ and DeepSeek²⁵. The results show MatterChat’s superior accuracy in estimating formation energies, consistently delivering predictions closer to the ground truths. For example, MatterChat’s formation energy predictions for mp-3202380 and mp-3206774 show a remarkable alignment with the ground-truth values. These results demonstrate MatterChat’s ability to integrate structural and textual data seamlessly for a wide range of material property tasks.

**Fig. 2: MatterChat accurately predicts material properties and outperforms state-of-the-art LLMs.**

Figure 3 demonstrates MatterChat’s advanced reasoning capabilities, showing how it leverages the comprehensive knowledge base of LLMs to address complex materials science challenges. By using a multimodal query system, MatterChat effectively combines material structure data with textual reasoning. This integration facilitates a working memory scheme⁴⁷, which enables the model to provide domain-specific reasoning, detailed synthesis procedures and explanations that are deeply grounded in the structural properties of materials. Figure 3a presents the chat log for silicon with the space group of cmcm. MatterChat not only retrieves the chemical formula and the correct space group but it also provides a rationale for the structural instability of this silicon phase. The model explains that the cmcm space group exhibits a higher energy per unit cell compared with the thermodynamically stable cubic diamond structure of silicon, making it less likely to occur under standard conditions. Figure 3b illustrates an interaction regarding a popular semiconductor material gallium nitride (GaN). Here MatterChat accurately identifies the chemical formula and space group (P63mc), and generates a detailed metal–organic chemical vapour deposition synthesis protocol that aligns with established experimental standards. Specifically, the model identifies trimethylgallium and ammonia as precursors within an 800–1,000 °C temperature window, directly matching landmark methods such as those reported elsewhere^48,49. This demonstrates the model’s ability to leverage inherited knowledge to provide practical, grounded and experimentally viable scientific reasoning. Figure 3c explores an interaction for a widely used ferrite material, yttrium iron garnet. MatterChat is able to take the structure and generate detailed text descriptions. Additionally, MatterChat can further generate a synthesis protocol for YIG that aligns with established experimental procedures⁵⁰. By identifying the correct 3:5 mixing ratio of Y₂O₃ and Fe₂O₃ and specifying critical parameters like the 5 °C min⁻¹ thermal rate, the model demonstrates its capability to apply domain-specific knowledge in accordance with standard practices and characterization techniques like X-ray diffraction and scanning electron microscopy⁵⁰. MatterChat generates synthesis guidance via a modular two-stage process without task-specific supervision. First, structural attributes—including formula, space group and crystal system—are extracted via a frozen encoder and tokenized to form a persistent working memory. Second, the LLM generates responses conditioned on this context, aligning with a symbolic memory framework⁴⁷ in which the inferred material facts anchor reasoning. By utilizing the LLM’s inherited knowledge with explicit structural signals, MatterChat produces physically plausible, literature-aligned synthesis outputs. This modularity ensures a clear boundary between material perception and linguistic reasoning, enhancing both interpretability and structure-conditioned generation.

**Fig. 3: MatterChat has the ability to solve more sophisticated tasks inherited from the pretrained LLM.**

MatterChat-extracted embeddings contain structural and property information

We further explore MatterChat’s ability to leverage material structural information by providing a detailed visualization/clustering analysis with the uniform manifold approximation and projection (UMAP) dimension reduction technique⁵¹. Figure 4a–e shows comprehensive visualizations of embeddings processed by the bridge model, with all material samples that contain silicon (Si), carbon (C) and their composites compounds (for example, SiC and Si_xC_y) from the Materials Project database⁵². UMAP was used to reduce the embeddings from an original 4,096 dimensions to two dimensions, with the x and y axes corresponding to the first and second reduced dimensions, respectively.

**Fig. 4: UMAP visualization of structural embeddings extracted from the bridge model.**

Figure 4a presents the visualizations containing all the selected materials; each sample is colour coded with a structure similarity score⁵³. The clustering generally follows distinctions in chemical compositions. Additionally, materials with the same atomic composition are grouped into separate clusters based on crystalline structural differences (for example, carbon with diamond versus graphite crystalline structure). Figure 4b,d shows the zoomed-in visualizations of clustering results for materials consisting exclusively of Si and SiC compositions. Figure 4d shows the gradient of structure similarity scores, ranging from blue (low similarity) to red (high similarity), demonstrating how closely related structural features result in spatial proximity within the embedding space. However, an interesting exception is observed with SiC (Fig. 4b): despite its identical composition and similar structural phases, two distinct clusters of SiC emerge, suggesting that factors beyond composition and structure alone influence their separation. To further explore factors that influence clustering, we labelled the samples according to their formation energy, with results displayed for SiC (Fig. 4c) and Si (Fig. 4e). These figures clearly show a trend from low to high formation energy. This analysis reveals that clusters grouped by structural similarity also align closely in terms of formation energy. Such findings indicate the model’s ability to produce embeddings that not only differentiate structural characteristics but also correlate with key material properties. To evaluate the generalization ability of MatterChat across a broader chemical space, we extended the structural embedding analysis beyond the initial silicon–carbon system to diverse material families (Supplementary Figs. 1–4). These include various iron-based compounds (oxides, sulfides, nitrides and carbides), as well as transition metal oxides containing iron, copper, cobalt and molybdenum. Similar trends are observed. The UMAP visualizations of the learned embeddings demonstrate that the model effectively captures the distinctive characteristics of different inorganic compounds. Distinct compound types form well-separated clusters in terms of both average structural similarity and formation energy similarity, whereas smooth transitions are observed within individual clusters. These findings suggest that both structural and property-related information are encoded in the learned representations, which is consistent with the property-supervised training of the model. Overall, the results indicate that the representations learned by the bridge model are robust and exhibit strong discriminative power across diverse material classes. Given that the embeddings derived from the bridge model preserve both material structure and property-relevant information, we implemented a multimodal RAG mechanism during inference (Fig. 4f). Instead of relying solely on a single output from MatterChat for each query–sample pair, we now retrieve additional information of two more samples from the material pool (training set). This retrieval is based on the L2 similarity between the embeddings of the sample material and those in the pool. After that, we aggregate all three results to get the final output by applying a majority-voting strategy for classification tasks and averaging for quantitative tasks. Such a method could further enhance the overall robustness of MatterChat across different tasks. The details of the visualization method are provided in Methods.

Comprehensive quantitative analysis for all material tasks

To evaluate MatterChat, we benchmarked its performance across nine tasks on the evaluation set (14,290 samples) against open-source LLMs (Vicuna⁵⁴, Mistral²³) and physical ML models (SchNet⁵⁵, CHGNet⁴¹) and MACE¹¹. For LLM baselines, material structures were serialized as CIF-derived text within identical prompt structures (Methods).

In classification (Fig. 5a–f), including metallicity, stability and magnetism, MatterChat consistently outperformed all baselines. In particular, it achieved higher accuracy than specialized physical models like CHGNet, demonstrating that integrating graph-based data with natural language reasoning provides a more holistic representation of material chemistry.

**Fig. 5: Performance comparison of MatterChat, open-source LLMs and physical pretrained models across nine material property tasks.**

For numerical property prediction (Fig. 5g–i), including formation energy, energy above hull and bandgap, MatterChat yielded the lowest root mean squared error (RMSE), whereas pure LLMs were excluded from comparison due to inherent limitations in quantitative precision⁵⁶. The framework’s robustness was further validated through fivefold cross-validation (Supplementary Figs. 7 and 8). Although the raw performance values of cross-validation decreased slightly across folds due to reduced training data, results remained consistent with the original train/test data split. These findings demonstrate that MatterChat effectively bridges qualitative scientific reasoning with quantitative atomistic characterization across diverse material domains.

Comparative study and visual attention analysis

To evaluate MatterChat’s architectural effectiveness, we compared it against established baseline strategies across all material property tasks (Extended Data Table 1). Our multimodal bootstrapping approach⁴² notably outperforms both the Simple Adapter^57,58 and pure LLM baselines, achieving superior accuracy and maintaining the efficiency of frozen pretrained components. Extensive ablation studies on bridge configurations, encoder selection and pretraining strategies further confirm that optimal cross-attention frequency and bridge pretraining are critical for model convergence and predictive precision (Methods). Ablation studies across different LLM backbones (e.g., Llama 3 and DeepSeek R1) and GNN encoders further demonstrate the architectural flexibility of MatterChat (Supplementary Table 3). Integrating a multimodal RAG module further enhances performance, reducing regression RMSE by ~12% and improving the classification accuracy by ~0.6%. This improvement is achieved with negligible computational overhead (latency, ~0.7%), demonstrating a favourable speed–accuracy trade-off for large-scale screening. Unless otherwise stated, baseline figures (for example, Figs. 2 and 3) reflect performance without RAG.

To assess cross-dataset generalization, we evaluated MatterChat on an external resource from the GNoME project⁴⁴. Despite considerable distributional shifts in target properties relative to our training data (Fig. 6d–f), MatterChat—particularly the MACE-based variant—demonstrates robust transferability, achieving superior accuracy across all tasks without additional fine tuning (Extended Data Table 2). These results indicate that equivariant structural representations generalize more effectively across diverse data sources. Furthermore, these gains underscore the advantage of MatterChat’s modular framework, which enables strong performance on external benchmarks without full-model retraining.

**Fig. 6: Visualization of structure–text alignment in MatterChat’s bridge model.**

To further investigate the interpretability of structure–text alignment, we analysed both similarity matrix between materials and text embeddings and the attention behaviour of the bridge model. We randomly selected 35 materials and computed the cosine similarity between the 24 structure embeddings (queries) and 24 token embeddings from the paired textual descriptions (chemical formula, space group and crystal system). This reveals consistent diagonal alignment in the embedding space (Fig. 6a), suggesting that specific structural slots are consistently linked with semantically meaningful linguistic features. The structural embeddings (indices 1–24) represent the graph-based representations of the materials listed in Supplementary Table 4, whereas the corresponding text embeddings represent their linguistic descriptors comprising chemical formula, space group and crystal system.

Beyond the diagonal alignment shown in Fig. 6a, off-diagonal patterns reveal a structured embedding space. Indices 16–23 show that complex multicomponent systems (for example, Li₅La₄TiNb₇O₂₈) cluster through shared coarse-grained characteristics rather than strictly element-specific distinctions, though index 19 remains distinct, preserving compositional specificity. Similarly, strong mutual similarities for indices 13 and 14 (cubic, Fm-3m) and 20 and 21 (monoclinic, 2/m) reflect the influence of shared structural symmetry on the joint representation. Although supporting physically meaningful clustering, these patterns identify a resolution limit for subtle intra-class variations, indicating the enhanced structural resolution as a priority for future refinement.

To investigate the model’s internal inference mechanism, we examined the attention distributions across material query indices for 20 random sampled stable and 20 unstable samples (Fig. 6b,c). Although foundational structural features are consistently captured in indices 0–4 and 9, distinct class-specific markers emerge that guide the model’s thermodynamic predictions. Specifically, stable materials uniquely activate indices 25 and 31, suggesting these embeddings key structural features associated with stability. Conversely, index 9 appears to function as a marker for instability; although it is used for both classes, its intensity is notably higher for unstable materials, suggesting it identifies energetically unfavourable atomic arrangements. These distinct patterns of query selection and attention intensity demonstrate that MatterChat does not merely recall data but effectively maps linguistic concepts onto physically relevant structural descriptors during inference.

Discussion

In this study, we present MatterChat, a multimodal framework that achieves superior performance in material properties prediction and scientific reasoning tasks by leveraging a more effective representation of materials. A key innovation of MatterChat is its ability to leverage existing advancements in both materials science and language modelling by integrating a pretrained material foundation encoder with a pretrained LLM. Rather than training an entire model from scratch, MatterChat achieves strong performance by training only a lightweight bridge model, efficiently aligning material structure representations with textual understanding and maintaining high accuracy across diverse materials science tasks. Moreover, MatterChat is designed for multitask learning, enabling it to handle both classification and numerical property prediction. This capability allows the framework to tackle a diverse range of materials science tasks within a unified model. Another advantage of our approach is the use of graph-based structural embeddings instead of relying solely on a .cif text input. Although CIF files encode atomic structures, their text-based format relies entirely on attention mechanisms, which can struggle to explicitly capture geometric symmetries and increase computational overhead due to lengthy tokenization. By directly processing atomic graphs, MatterChat effectively preserves material symmetry and spatial relationships, leading to more accurate structure–property learning and maintaining computational efficiency. Furthermore, we evaluated its performance on the derived properties that require structural internalization, such as atom counts and density (Supplementary Table 5). Although the model accurately retrieves discrete identifiers like the number of atoms per unit cell (28 atoms), it exhibits a ‘resolution gap’ in predicting continuous numerical properties such as volume and density. This shows a common limitation of LLMs in high-precision zero-shot numerical regression, despite their success in structural reasoning tasks.

Limitation and future work

(1) Alignment and interpretation: MatterChat’s behavioural success on property tasks may reflect learned correlations rather than deep semantic internalization of graph-based structural semantics. This limits interpretability and compositional reasoning involving structural concepts. Addressing this requires explicit representation-level alignment objectives—such as contrastive losses, modality matching or shared embedding projections—to ensure the LLM fully grounds language in atomic representations^59,60,61,62.

(2) Data and reasoning: current training relies on single-turn question–answer pairs, lacking the multistep reasoning and cross-modal inference chaining essential for expert enquiry^63,64,65. Future developments should transition towards multiturn, multimodal dialogue trajectories. Techniques like phased instruction tuning^66,67,68 and least-to-most prompting⁶⁹ offer promising pathways for stepwise scientific problem-solving grounded in material structures.

(3) Hallucination and reliability: frozen LLM backbones are susceptible to hallucinations in which language priors dominate structural information^70,71,72. Although RAG provides initial contextual grounding⁷³, future modular enhancements are necessary. These include multimodal fusion techniques (for example, mixture of features)⁷⁴, domain-adaptive fine tuning on expert corpora^75,76 and hallucination-aware training objectives^25,77,78,79. Finally, post hoc correction frameworks—including fact-checking and self-revision loops—can further enhance the reliability of open-ended scientific responses^80,81.

Finally, although the current work prioritizes structure-informed reasoning, MatterChat’s modular architecture is designed for future extensibility to text-only materials benchmarks^33,35,36,37. Its interchangeable components provide a flexible framework for potential systematic evaluation on tasks like synthesis question–answer classification from abstracts in future studies, offering a pathway to further bridge the gap between linguistic and structure-aware understanding.

Methods

Dataset curation

In this work, we curated a comprehensive dataset from the Materials Project Trajectory (MPtrj) dataset⁵², focusing specifically on relaxed samples. By selecting these stable configurations rather than complete trajectory data, we ensure that the dataset captures the equilibrium states of materials, which are more relevant for downstream tasks such as material property prediction. The final dataset consists of 142,899 high-quality samples, offering a rich and diverse representation of inorganic materials.

To facilitate effective model training and evaluation, we randomly shuffled the dataset and partitioned it into training and testing subsets using a 9:1 split ratio. This ensures that a substantial portion of data is available for learning, maintaining a dedicated portion for rigorous performance validation, allowing us to assess the generalization capabilities of the model.

In addition to the relaxed structural data, we retrieved detailed material property information using the Materials Project API⁴³. Each material is retrieved by a unique mp-id and is enriched with a variety of key descriptors that span both structural and electronic properties. These include

Structure: the full atomic structure of the material, detailing atomic positions and bonding.
Chemical formula: the overall chemical composition.
Space group: the crystallographic space group of the material, reflecting its symmetry properties.
Crystal system: the broader classification of the material’s crystal structure.
Metallicity: an indicator of whether the material is metallic or insulating.
Magnetic properties: whether the material is magnetic and its magnetic ordering (for example, ferromagnetic or antiferromagnetic).
Experimental observables: properties that can be compared directly with experimental data.
Direct bandgap: the direct bandgap energy, a key property for semiconductors.
Stability: whether the material is thermodynamically stable.
Energy above hull: a measure of how stable the material is compared with other phases.
Bandgap: the electronic bandgap, an important factor in determining a material’s electronic properties.
Formation energy: energy required to form the material from its constituent elements.

These attributes offer a comprehensive view of each material, encompassing both its structural arrangement and electronic behaviour. By integrating this wealth of data, our model is capable of capturing complex material property relationships, supporting tasks such as bandgap prediction, stability analysis and metallicity determination. This dataset not only provides a robust foundation for training ML models but it also contributes to broader efforts in materials discovery and property optimization.

Training detail

MatterChat uses a bootstrapping strategy commonly used in multimodal learning for vision-language tasks, adapted here for materials science applications. The training process consists of two main stages: pretraining to align material structures with descriptive text, and fine tuning for both descriptive and property prediction tasks with the LLM module integrated (Supplementary Fig. 2). The pretraining phase aims to establish a foundational alignment between material structures and descriptive text. In this stage, the model connects a frozen graph encoder with pairs of graph data and the corresponding textual descriptions, without attaching the LLM module. Here the bridge model acts as a text generator, learning to extract descriptive graph representations that effectively capture structural information relevant to the text data. MatterChat utilizes pretrained checkpoints for graph encoders. For the invariant encoder, we use the CHGNet model pretrained on the MPTraj dataset of Materials Project structures and energies^41,52. For the equivariant encoder, we use the publicly released MACE-MP-0 (large) model¹¹. These encoders provide chemically meaningful atom-level representations and are integrated into MatterChat without additional pretraining of the graph encoder.

This stage consists of three core optimizing targets, each with distinct interaction mechanisms between graph embeddings and text, and maintain a consistent input format:

1.
Graph–text correlation learning (contrastive loss). This task aligns graph and text representations by maximizing the similarity between matched graph–text pairs and minimizing it for mismatched pairs. A contrastive loss is used:
$$\begin{array}{c}{{\mathcal{L}}}_{{\text{correlation}}}=-\mathop{\sum }\limits_{i=1}^{N}{\text{log}}\end{array}\left[\frac{{\text{exp}}({\text{sim}}({q}_{i}{,}{t}_{i})/\tau )}{{\sum }_{j{=}1}^{N}{\text{exp}}({\text{sim}}{(}{q}_{i}{,}{t}_{j})/\tau {)}}\right],$$
(1)
where q_i and t_i represent the graph and text embeddings, respectively, and τ is the temperature parameter controlling the distribution’s sharpness.
2.
Graph-driven text prediction (conditional language modelling loss). The bridge model generates descriptive text based on graph data, conditioned through attention mechanisms. The loss function is defined as
$${{\mathcal{L}}}_{{\text{prediction}}}=-\mathop{\sum }\limits_{t=1}^{T}{\text{log}}[P({y}_{t}|{y}_{ < t},Q)],$$
(2)
where Q represents the graph query features, and y_t is the token at position t in the output sequence.
3.
Graph–text association (binary cross-entropy loss). This task predicts whether each graph–text pair is correctly matched. A binary cross-entropy loss with hard negative sampling is applied:
$${{\mathcal{L}}}_{{\text{association}}}=-\mathop{\sum }\limits_{i=1}^{N}({y}_{i}{\text{log}}[{s}_{i}]+(1-{y}_{i}){\text{log}}[1-{s}_{i}]),$$
(3)
where s_i is the model’s prediction score and y_i indicates whether the pair is matched (1) or not (0).

The total pretraining loss is the sum of the individual task losses:

$${{\mathcal{L}}}_{{\rm{total}}}={{\mathcal{L}}}_{{\rm{correlation}}}+{{\mathcal{L}}}_{{\rm{prediction}}}+{{\mathcal{L}}}_{{\rm{association}}}.$$

(4)

After pretraining, the model undergoes instructive fine tuning to optimize its performance on both descriptive and property prediction tasks. In this stage, the pretrained bridge model is integrated with the LLM to enhance multimodal learning. A fully connected layer is introduced between the bridge model’s output and the LLM’s input. The fine-tuning phase includes 12 multimodal subtasks, including three material description tasks and nine property prediction tasks. Description tasks refine the model’s ability to link structural features with detailed textual explanations, whereas property prediction tasks focus on improving quantitative accuracy in material property estimation. Fine tuning is guided by a supervised cross-entropy loss defined as

$${{\mathcal{L}}}_{\mathrm{fine}\,\mathrm{tune}}=-\mathop{\sum }\limits_{i=1}^{N}\mathop{\sum }\limits_{j=1}^{T}{y}_{i,j}\log [P({y}_{i,j}|{x}_{i})],$$

(5)

where y_i,j represents the ground-truth token for the jth position of the ith sample, and P(y_i,j∣x_i) is the model’s predicted probability of the correct token given the multimodal input x_i.

In the pretraining stage, the model is trained using the AdamW optimizer with a learning rate of 2 × 10⁻⁴, with a cosine decay scheduler and linear warm-up starting from 1 × 10⁻⁶. A weight decay of 0.05 is applied to regularize the model, with a batch size of 32 and gradient accumulation over five steps to manage computational efficiency. Mixed-precision training is enabled to improve the performance and reduce memory usage. The model is trained for ~25 epochs, with checkpoints saved every 2,000 iterations. During the fine-tuning stage, the AdamW optimizer is again used with a learning rate of 2 × 10⁻⁴, featuring a warm-up phase to 1 × 10⁻⁴ followed by decay to 1 × 10⁻⁵. The batch size is set to 8, with gradient accumulation over 16 batches to effectively increase the batch size. Fine-tuning runs for around 20 epochs, with checkpoints saved every 300 steps and at the end of each epoch. Additionally, distributed training is implemented using four A100 GPUs per node across eight nodes, leveraging the distributed data parallel strategy to enhance training efficiency and scalability. It takes around 48 h to complete the training. We have also summarized the training hyperparameters used across all baseline GCN models to ensure consistent evaluation. The SchNet model was trained using consistent hyperparameters across classification and regression tasks to ensure fair comparison. We used the Adam optimizer with a learning rate of 1 × 10⁻⁵ and weight decay of 1 × 10⁻⁴, along with a StepLR scheduler (step size of 20, γ = 0.5). Models were trained for 50 epochs with a batch size of 16. Cross-entropy loss was used for classification and mean squared error loss was used for regression. All the CHGNet models were trained using the SchNet-style optimizer and scheduler with a learning rate of 1 × 10⁻⁵ for classification and 1 × 10⁻³ for regression. All models were trained for 50 epochs with a batch size of 16. Cross-entropy loss was used for classification and mean squared error loss was used for regression. These settings were applied uniformly to both pretrained and non-pretrained CHGNet variants. The MACE model was trained using consistent hyperparameters across classification and regression tasks. We used the Adam optimizer with a learning rate of 1 × 10⁻⁴ and a weight decay of 1 × 10⁻⁴, together with a StepLR scheduler (step size of 20, γ = 0.5). All models were trained for 100–200 epochs with a batch size of 256. Cross-entropy loss was used for classification tasks and the mean squared error loss was used for regression tasks.

Embedding visualization

The visualization leverages UMAP to reveal chemical insights encoded in the material embeddings that are extracted from the bridge model in a lower-dimensional space. To prepare the data, each high-dimensional embedding, originally structured as (32, 4,096), is first flattened into a single vector, capturing the essential features of the material. UMAP is then applied to this set of vectors with number of components equals 2, reducing the data to two dimensions to enable visual interpretation, with random state is set to 1 to ensure consistency in the layout across runs.

Structural similarity scores are computed using the smooth overlap of atomic positions (SOAP) descriptor⁸², combined with the regularized entropy match kernel (REMatch)^83,84 to capture the structural characteristics within material embeddings. SOAP is a local atomic environment descriptor that encodes atomic geometries by expanding a Gaussian-smeared atomic density locally, using orthonormal functions derived from spherical harmonics and radial basis functions. From local descriptors to structure matching, we use the REMatch kernel on top of the SOAP descriptor. The REMatch kernel considers the best matching of local environments and uses an averaging strategy to enhance structural comparison. For SOAP construction, we consider periodic boundary conditions. The cut-off radius for the local region (r_cut), the number of radial basis functions (n_max) and the maximum degree of spherical harmonics (l_max) are set to 6 Å, 8 Å and 6 Å, respectively. For the REMatch kernel, the entropic penalty (α) is set to 1, and the convergence threshold is set to 1 × 10⁻⁶. A linear pairwise metric is used for the local similarity calculation.

Baseline and RAG configurations for comparative study

We assessed MatterChat against two primary baselines: (1) a multimodal LLM using a Simple Adapter with low-rank adaptation fine tuning^57,58, updating lightweight adapter layers and the Mistral 7B backbone; and (2) a pure LLM baseline fine tuned on serialized CIF content. Our bootstrapping strategy⁴² trains only the bridge module, avoiding extensive fine tuning of the frozen graph encoder and LLM. Ablation studies (Supplementary Tables 1–3 and Fig. 6) covered variations in query token length, cross-attention frequency and pretraining strategies. Results indicate that cross-attention every two layers and query lengths as low as eight tokens maintain a strong balance between efficiency and multimodal alignment. The RAG module utilizes a Faiss (Facebook AI Similarity Search)-based⁸⁵ batched cosine similarity search over ~142,000 structural embeddings. We measured an average retrieval latency of ~12 ms per query on CPU (total of <3 min for 14,290 queries). Compared with the baseline inference time of ~1.65 s per sample, this introduces only ~0.7% latency.

Cross-dataset evaluation

We curated an external test set of ~15,000 materials from the GNoME database⁴⁴ to evaluate the transferability of our model. This subset includes available density-functional-theory-computed values for bandgap, formation energy and energy above hull, providing a benchmark comparable in scale to our original test split. Distributional differences between this external set and the MPtrj training distribution⁵² were characterized via property histograms (Fig. 6c–e).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The source data generated in this study and the datasets and saved models for running codes are available via Zenodo at https://doi.org/10.5281/zenodo.18735961 (ref. ⁸⁶). Source data are provided with this paper.

Code availability

The codes that support the results within this paper and other findings of this study are available via Zenodo at https://doi.org/10.5281/zenodo.18735881 (ref. ⁸⁷).

References

Kohn, W., Becke, A. D. & Parr, R. G. Density functional theory of electronic structure. J. Phys. Chem. 100, 12974–12980 (1996).
Article Google Scholar
Marx, D. & Hutter, J. Ab Initio Molecular Dynamics: Basic Theory and Advanced Methods (Cambridge Univ. Press, 2009).
Hwang, H. Y. et al. Emergent phenomena at oxide interfaces. Nat. Mater. 11, 103–113 (2012).
Article Google Scholar
Li, D. et al. Superconductivity in an infinite-layer nickelate. Nature 572, 624–627 (2019).
Article Google Scholar
Keimer, B. & Moore, J. E. The physics of quantum materials. Nat. Phys. 13, 1045–1055 (2017).
Article Google Scholar
Cao, Y. et al. Unconventional superconductivity in magic-angle graphene superlattices. Nature 556, 43–50 (2018).
Article Google Scholar
Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120, 145301 (2018).
Article Google Scholar
Xu, W., Reuter, K. & Andersen, M. Predicting binding motifs of complex adsorbates using machine learning with a physics-inspired graph representation. Nat. Comput. Sci. 2, 443–450 (2022).
Article Google Scholar
Batzner, S. et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).
Article Google Scholar
Gasteiger, J., Becker, F. & Günnemann, S. GemNet: universal directional graph neural networks for molecules. In 35th Conference on Neural Information Processing Systems 6790–6802 (NeurIPS, 2021).
Batatia, I. et al. A foundation model for atomistic materials chemistry. J. Chem. Phys. 163, 184110 (2025).
Article Google Scholar
Qiao, Z., Welborn, M., Anandkumar, A., Manby, F. R. & Miller, T. F. OrbNet: deep learning for quantum chemistry using symmetry-adapted atomic-orbital features. J. Chem. Phys. 153, 124111 (2020).
Yang, H. et al. MatterSim: a deep learning atomistic model across elements, temperatures and pressures. Preprint at https://arxiv.org/abs/2405.04967 (2024).
Zeni, C. et al. A generative model for inorganic materials design. Nature 639, 624–632 (2025).
Article Google Scholar
Xie, T., Fu, X., Ganea, O.-E., Barzilay, R. & Jaakkola, T. Crystal diffusion variational autoencoder for periodic material generation. Preprint at https://arxiv.org/abs/2110.06197 (2021).
Ling, C. A review of the recent progress in battery informatics. npj Comput. Mater. 8, 33 (2022).
Liu, D.-Y. et al. Machine learning for semiconductors. Chip 1, 100033 (2022).
Article Google Scholar
Yang, W., Fidelis, T. T. & Sun, W.-H. Machine learning in catalysis, from proposal to practicing. ACS Omega 5, 83–88 (2020).
Article Google Scholar
Sen, S. K. et al. Opportunities for basic, clinical, and bioethics research at the intersection of machine learning and genomics. Cell Genom. 4, 100466 (2024).
Article Google Scholar
Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Science in the age of large language models. Nat. Rev. Phys. 5, 277–280 (2023).
Article Google Scholar
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186 (ACL, 2019).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Google Scholar
Jiang, A. Q. et al. Mistral 7B. Preprint at https://arxiv.org/abs/2310.06825 (2023).
Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).
Guo, D. et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).
Lu, P. et al. Learn to explain: multimodal reasoning via thought chains for science question answering. In 36th Conference on Neural Information Processing Systems (eds Oh, A. H. et al.) 2507–2521 (NeurIPS, 2022).
Dunn, A. et al. Structured information extraction from complex scientific text with fine-tuned large language models. Preprint at https://arxiv.org/abs/2212.05238 (2022).
Kim, S., Jung, Y. & Schrier, J. Large language models for inorganic synthesis predictions. J. Am. Chem. Soc. 146, 19654–19659 (2024).
Article Google Scholar
Cavanagh, J. M. et al. SMILEyLLaMA: modifying large language models for directed chemical space exploration. Preprint at https://arxiv.org/abs/2409.02231 (2024).
Song, Y., Miret, S. & Liu, B. MatSci-NLP: evaluating scientific language models on materials science language tasks using text-to-schema modeling. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 3621–3639 (ACL, 2023).
Zaki, M. et al. MASCQA: investigating materials science knowledge of large language models. Digit. Discov. 3, 313–327 (2024).
Article Google Scholar
Song, Y., Miret, S., Zhang, H. & Liu, B. HoneyBee: progressive instruction finetuning of large language models for materials science. In Findings of the Association for Computational Linguistics: EMNLP 2023 5724–5739 (ACL, 2023).
Mirza, A. et al. A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. Nat. Chem. 17, 1027–1034 (2025).
Alampara, N., Miret, S. & Jablonka, K. M. MatText: do language models need more than text & scale for materials modeling? Preprint at https://arxiv.org/pdf/2406.17295v1 (2024).
Mishra, V. et al. Foundational large language models for materials research. Preprint at https://arxiv.org/abs/2412.09560 (2024).
Miret, S. & Krishnan, N. A. Are LLMs ready for real-world materials discovery? Preprint at https://arxiv.org/abs/2402.05200 (2024).
Ramos, M. C., Collison, C. J. & White, A. D. A review of large language models and autonomous agents in chemistry. Chem. Sci. 16, 2514–2572 (2025).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 18, 31–36 (1988).
Article Google Scholar
Antunes, L. M., Butler, K. T. & Grau-Crespo, R. Crystal structure generation with autoregressive large language modeling. Nat. Commun. 15, 10570 (2024).
Article Google Scholar
Ock, J., Guntuboina, C. & Barati Farimani, A. Catalyst energy prediction with CatBERTa: unveiling feature exploration strategies through large language models. ACS Catal. 13, 16032–16044 (2023).
Article Google Scholar
Deng, B. et al. CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling. Nat. Mach. Intell. 5, 1031–1041 (2023).
Article Google Scholar
Li, J., Li, D., Savarese, S. & Hoi, S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. 40th International Conference on Machine Learning 19730–19742 (ICML, 2023).
Jain, A. et al. Commentary: The Materials Project: a materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).
Merchant, A. et al. Scaling deep learning for materials discovery. Nature 624, 80–85 (2023).
Article Google Scholar
Gemini Team Google. Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).
Hurst, A. et al. GPT-4o system card. Preprint at https://arxiv.org/abs/2410.21276 (2024).
Wang, S., Wei, Z., Choi, Y. & Ren, X. Symbolic working memory enhances language models for complex rule application. In Proc. 2024 Conference on Empirical Methods in Natural Language Processing 17583–17604 (ACL, 2024).
Nakamura, S. N. GaN growth using GaN buff er layer. Jpn. J. Appl. Phys. 30, 1705 (1991).
Article Google Scholar
Dupuis, R. Epitaxial growth of III–V nitride semiconductors by metalorganic chemical vapor deposition. J. Cryst. Growth 178, 56–73 (1997).
Article Google Scholar
Sharma, V., Saha, J., Patnaik, S. & Kuanr, B. K. Synthesis and characterization of yttrium iron garnet (YIG) nanoparticles—microwave material. AIP Adv. 7, 056405 (2016).
Article Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Deng, B. Materials project trajectory (MPtrj) dataset. figshare https://doi.org/10.6084/m9.figshare.23713842.v2 (2023).
Himanen, L. et al. DScribe: library of descriptors for machine learning in materials science. Comput. Phys. Commun. 247, 106949 (2020).
Article Google Scholar
The Vicuna Team. Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. LMSYS http://lmsys.org/blog/2023-03-30-vicuna/ (2023).
Schütt, K. et al. SchNet: a continuous-filter convolutional neural network for modeling quantum interactions. In Proc. 31st International Conference on Neural Information Processing Systems 992–1002 (NeurIPS, 2017).
Feng, G. et al. How numerical precision affects arithmetical reasoning capabilities of LLMs. In Findings of the Association for Computational Linguistics: ACL 2025 46–85 (ACL, 2025).
Hu, E. J. et al. LoRA: low-rank adaptation of large language models. Preprint at https://arxiv.org/abs/2106.09685 (2022).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. In Proc. 37th International Conference on Neural Information Processing System 34892–34916 (NeurIPS, 2023).
Shu, D. et al. Large vision-language model alignment and misalignment: a survey through the lens of explainability. In Findings of the Association for Computational Linguistics: EMNLP 2025 1713–1735 (ACL, 2025).
Gemma Team. Gemma 3 technical report. Preprint at https://arxiv.org/abs/2503.19786 (2025).
Zhang, Y. et al. Meta-Transformer: a unified framework for multimodal learning. Preprint at https://arxiv.org/abs/2307.10802 (2023).
Wang, Z. et al. Connecting multi-modal contrastive representations. Adv. Neural Inf. Process. Syst. 36, 22099–22114 (2023).
Google Scholar
Zhang, Z. et al. Multimodal chain-of-thought reasoning in language models. Preprint at https://arxiv.org/abs/2302.00923 (2023).
Li, Y. et al. Beyond single-turn: a survey on multi-turn interactions with large language models. Preprint at https://arxiv.org/abs/2504.04717 (2025).
Huang, M. et al. DialogGen: multi-modal interactive dialogue system with multi-turn text-image generation. In Findings of the Association for Computational Linguistics: NAACL 2025 411–426 (ACL, 2025).
Guan, C. et al. Multi-stage LLM fine-tuning with a continual learning setting. In Findings of the Association for Computational Linguistics: NAACL 2025 5499–5513 (ACL, 2025).
Xu, C. et al. WizardLM: empowering large pre-trained language models to follow complex instructions. In 12th International Conference on Learning Representations (ICLR, 2024).
Pang, W., Zhou, C., Zhou, X.-H. & Wang, X. Phased instruction fine-tuning for large language models. In Findings of the Association for Computational Linguistics: ACL 2024 5735–5748 (ACL, 2024).
Zhou, D. et al. Least-to-most prompting enables complex reasoning in large language models. Preprint at https://arxiv.org/abs/2205.10625 (2022).
Fu, Y., Xie, R., Sun, X., Kang, Z. & Li, X. Mitigating hallucination in multimodal large language model via hallucination-targeted direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2025 16563–16577 (ACL, 2025).
Bai, Z. et al. Hallucination of multimodal large language models: a survey. Preprint at https://arxiv.org/abs/2404.18930 (2024).
Tonmoy, S. et al. A comprehensive survey of hallucination mitigation techniques in large language models. Preprint at https://arxiv.org/abs/2401.01313 (2024).
Li, J., Yuan, Y. & Zhang, Z. Enhancing LLM factual accuracy with RAG to counter hallucinations: a case study on domain-specific queries in private knowledge-bases. Preprint at https://arxiv.org/abs/2403.10446 (2024).
Tong, S. et al. Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition 9568–9578 (CVPR, 2024).
Grayson, M., Patterson, C., Goldstein, B., Ivanov, S. & Davidson, M. Mitigating hallucinations in large language models using a channel-aware domain-adaptive generative adversarial network (CADAGAN). Preprint at https://doi.org/10.21203/rs.3.rs-5164079/v1 (2024).
Valentino, M., Kim, G., Dalal, D., Zhao, Z. & Freitas, A. Mitigating content effects on reasoning in language models through fine-grained activation steering. In Proc. AAAI Conference on Artificial Intelligence 33314–33322 (AAAI, 2026).
Sun, Z. et al. Aligning large multimodal models with factually augmented RLHF. In Findings of the Association for Computational Linguistics: ACL 2024 13088–13110 (ACL, 2024).
Zhao, Z. et al. Beyond multimodal hallucinations: enhancing LVLMs through hallucination-aware direct preference optimization. In 2025 IEEE International Conference on Multimedia and Expo (ICME) 1–6 (IEEE, 2025).
Ranaldi, L., Valentino, M. & Freitas, A. Improving chain-of-thought reasoning via quasi-symbolic abstractions. In Proc. 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 17222–17240 (ACL, 2025).
Zhou, Y. et al. Analyzing and mitigating object hallucination in large vision-language models. Preprint at https://arxiv.org/abs/2310.00754 (2023).
Lee, S., Park, S. H., Jo, Y. & Seo, M. Volcano: mitigating multimodal hallucination through self-feedback guided revision. In Proc. the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) 391–404 (ACL, 2024).
Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87, 184115 (2013).
Article Google Scholar
De, S., Bartók, A. P., Csányi, G. & Ceriotti, M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 18, 13754–13769 (2016).
Article Google Scholar
Musil, F. et al. Machine learning for the structure–energy–property landscapes of molecular crystals. Chem. Sci. 9, 1289–1300 (2018).
Article Google Scholar
Douze, M. et al. The Faiss library. Preprint at https://arxiv.org/abs/2401.08281 (2024).
Tang, Y. Dataset for MatterChat. Zenodo https://doi.org/10.5281/zenodo.18735961 (2026).
Tang, Y. Code for MatterChat. Zenodo https://doi.org/10.5281/zenodo.18735881 (2026).

Download references

Acknowledgements

Y.T. and Z.J.Y. were supported by the US Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, ‘Transformational AI Model Consortium’ under LAB-25-3560. This work was supported in part by previous breakthroughs obtained through the Laboratory Directed Research and Development Program of Lawrence Berkeley National Laboratory under US Department of Energy contract number DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a Department of Energy, Office of Science User Facility, supported by the Office of Science, US Department of Energy, under contract number DE-AC02-05CH11231 and under NERSC GenAI award number DDR-ERCAP0030541. W.G. acknowledges support from the National Science Foundation through grant number 2235276.

Author information

These authors contributed equally: Yingheng Tang, Wenbin Xu.

Authors and Affiliations

Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Yingheng Tang, Andy Nonaka & Zhi Jackie Yao
National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Wenbin Xu & Steven Farrell
NSF National AI Institute for Student-AI Teaming, University of Colorado at Boulder, Boulder, CO, USA
Jie Cao
Department of Electrical and Computer Engineering, The University of Utah, Salt Lake City, UT, USA
Weilu Gao
Scientific Data Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Benjamin Erichson & Michael W. Mahoney
International Computer Science Institute, Berkeley, CA, USA
Benjamin Erichson & Michael W. Mahoney
Department of Statistics, University of California at Berkeley, Berkeley, CA, USA
Michael W. Mahoney

Authors

Yingheng Tang
View author publications
Search author on:PubMed Google Scholar
Wenbin Xu
View author publications
Search author on:PubMed Google Scholar
Jie Cao
View author publications
Search author on:PubMed Google Scholar
Weilu Gao
View author publications
Search author on:PubMed Google Scholar
Steven Farrell
View author publications
Search author on:PubMed Google Scholar
Benjamin Erichson
View author publications
Search author on:PubMed Google Scholar
Michael W. Mahoney
View author publications
Search author on:PubMed Google Scholar
Andy Nonaka
View author publications
Search author on:PubMed Google Scholar
Zhi Jackie Yao
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.T. and W.G. conceived of the idea, with W.X. contributing enhancements and Z.J.Y. providing additional support. Z.J.Y. supervised the project. Y.T. constructed the overall ML framework and performed the ML training/inference. W.X. performed the physical model training and visualization. Y.T. and W.X. wrote the paper with the help of Z.J.Y., J.C., A.N., S.F., B.E. and M.W.M.

Corresponding authors

Correspondence to Yingheng Tang, Wenbin Xu, Weilu Gao or Zhi Jackie Yao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Indra Priyadarsini and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 Comparison of material property prediction performance across different multi-modal frameworks and RAG-enhanced inference

Full size table

Extended Data Table 2 Comparison of MatterChat performance using CHGNet vs MACE (large) encoder backbones with Out Of Distribution dataset (OOD)

Full size table

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–9 and Tables 1–5.

Reporting Summary (download PDF )

Source data

Source Data Fig. 1 (download XLSX )

Source data for Fig. 1b,c, including samples for each chemical composition and space group.

Source Data Fig. 2 (download XLSX )

Source data for Fig. 2, containing all the quantitative values in the figure.

Source Data Fig. 4 (download TXT )

Source data for Fig. 4, containing all the scatter plot data.

Source Data Fig. 5 (download XLSX )

Source data for Fig. 5, containing the raw data of the performance analysis of different models.

Source Data Fig. 6 (download XLSX )

Source data for Fig. 6, containing raw values of the attention analysis and dataset distribution.

Source Data Extended Data Table 1 (download XLSX )

Source data for Extended Data Table 1, containing raw values of the comparative study of different multimodal LLM setups.

Source Data Extended Data Table 2 (download XLSX )

Source data for Extended Data Table 2, containing raw values of different MatterChat settings in the OOD dataset.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Tang, Y., Xu, W., Cao, J. et al. A multimodal large language model for materials science. Nat Mach Intell 8, 588–601 (2026). https://doi.org/10.1038/s42256-026-01214-y

Download citation

Received: 04 April 2025
Accepted: 06 March 2026
Published: 24 April 2026
Version of record: 24 April 2026
Issue date: April 2026
DOI: https://doi.org/10.1038/s42256-026-01214-y