A versatile multimodal learning framework bridging multiscale knowledge for material design

Wu, Yuhui; Ding, Minmin; He, Haonan; Wu, Qijun; Jiang, Shaohua; Zhang, Peng; Ji, Jian

doi:10.1038/s41524-025-01767-3

Download PDF

Article
Open access
Published: 25 August 2025

A versatile multimodal learning framework bridging multiscale knowledge for material design

Yuhui Wu^1,2,
Minmin Ding³,
Haonan He^1,2,4,
Qijun Wu³,
Shaohua Jiang³,
Peng Zhang^1,2,4 &
…
Jian Ji^1,2,4,5

npj Computational Materials volume 11, Article number: 276 (2025) Cite this article

3198 Accesses
11 Altmetric
Metrics details

Subjects

Abstract

Artificial intelligence has achieved remarkable success in materials science, accelerating novel material design. However, real-world material systems exhibit multiscale complexity—spanning composition, processing, structure, and properties—posing significant challenges for modeling. While some approaches fuse multiscale features to improve prediction, important modalities such as microstructure are often missing due to high acquisition costs. Existing methods struggle with incomplete data and lack a framework to bridge multiscale material knowledge. To address this, we propose MatMCL, a structure-guided multimodal learning framework that jointly analyzes multiscale material information and enables robust property prediction with incomplete modalities. Using a self-constructed multimodal dataset of electrospun nanofibers, we demonstrate that MatMCL improves mechanical property prediction without structural information, generates microstructures from processing parameters, and enables cross-modal retrieval. We further extend it via multi-stage learning and apply it to nanofiber-reinforced composite design. MatMCL uncovers processing-structure-property relationships, suggesting its promise as a generalizable approach for AI-driven material design.

Multiscale computational framework linking alloy composition to microstructure evolution via machine learning and nanoscale analysis

Article Open access 15 July 2025

Multi-modal Dataset of a Polycrystalline Metallic Material: 3D Microstructure and Deformation Fields

Article Open access 01 August 2022

Revealing nanostructures in high-entropy alloys via machine-learning accelerated scalable Monte Carlo simulation

Article Open access 20 August 2025

Introduction

Artificial intelligence (AI) has had a transformative impact on the material development process due to its ability to model complex and diverse material systems^1,2,3. Recently, major breakthroughs in materials science driven by AI have been achieved, including crystal structure discovery^4,5,6, alloy development^7,8,9, nanomaterial formulation optimization^10,11,12, polymer material design^{13,14,15,16,17,18}, which greatly accelerated the design of new materials.

Despite these advancements, AI still faces significant obstacles when tackling complex problems in materials science¹⁹. One key difficulty stems from the inherent complexity and hierarchical nature of materials, which are characterized by multiple scales of information and heterogeneous data types—including chemical composition, microstructure, macroscopic morphology, spectral characteristics—that are often correlated or complementary. Consequently, capturing and integrating these multiscale features is crucial for accurately representing material systems and enhancing model generalization, yet remains a considerable challenge for AI in material modeling. Moreover, due to the high cost and complexity of material synthesis and characterization, the amount of available data in materials science remains severely limited, creating substantial barriers to model training and reducing predictive reliability²⁰.

To address these challenges, we propose the adoption of multimodal learning (MML) in the field of materials science. MML aims to integrate and process multiple types of data, referred to as modalities, including text, images, audio, and video, and has achieved significant success in domains such as natural language processing and computer vision²¹. In recent years, several studies have extended MML to material research by integrating diverse material data obtained through multiple characterization techniques^{22,23,24,25,26,27}. These approaches enhance the model’s understanding of complex material systems and mitigate data scarcity, ultimately improving predictive performance. However, these MML approaches remain limited in two key aspects: (1) Material datasets are frequently incomplete due to experimental constraints and the high cost of acquiring certain measurements. For instance, synthesis parameters are often readily available, whereas microstructural data, such as those obtained from SEM or XRD, are more expensive and difficult to obtain, resulting in missing modalities. Conventional MML models rely on complete modality availability, and their performance deteriorates significantly when modalities are missing, thereby limiting their practical applicability²⁸. (2) Furthermore, existing methods lack efficient cross-modal alignment and typically do not provide a systematic framework for modality transformation or mapping mechanisms. These limitations pose obstacles to the broader application of AI in materials science, particularly for complex material systems where multimodal data and incomplete characterizations are prevalent.

Recent advances in MML, such as CLIP²⁹, ImageBind³⁰, and DALLE-2³¹ have demonstrated impressive cross-modal understanding capabilities. Inspired by these developments, we propose MatMCL, a versatile MML framework tailored to materials science that flexibly handles missing modalities and facilitates effective interaction and transformation across diverse and multiscale material features. As a case study to validate the effectiveness of the proposed framework, we construct a multimodal dataset of electrospun nanofibers through laboratory preparation and characterization. A geometric multimodal contrastive learning strategy, named structure-guided pre-training (SGPT), is employed to align modality-specific and fused representations. By guiding the model to capture structural features, this approach enhances representation learning and mitigates the impact of missing modalities, ultimately boosting material property prediction performance. To further enhance cross-modal understanding, MatMCL incorporates a retrieval module for knowledge extraction across modalities and a conditional generation module that enables the generation of structures according to given conditions. Additionally, we introduce a multi-stage learning strategy (MSL) to extend the applicability of the framework, as demonstrated by guiding the design of nanofiber-reinforced composites. In summary, MatMCL provides an effective framework for modeling the relationships among processing conditions, microstructure, and properties in electrospun nanofiber materials, even under limited or incomplete data. Its flexibility and robustness highlight its promise for broader applications in multiscale, multimodal materials modeling.

Results

In this work, we propose a versatile multimodal learning framework for material design, referred to as MatMCL. We evaluate MatMCL using a multimodal dataset of electrospun nanofibers. The framework includes four modules: (1) A structure-guided pre-training module that guides the model to learn structural knowledge through self-supervised learning; and three downstream modules for (2) property prediction under missing structure, (3) cross-modal retrieval, and (4) conditional structure generation.

Multimodal dataset construction

To verify the feasibility of MatMCL, we first construct a multimodal benchmark dataset through laboratory preparation and characterization. Nanofibers have received extensive attention due to their high surface area, high porosity, and considerable mechanical strength³². Electrospinning is one of the most widely used methods for fabricating nanofibers. The morphology and arrangement of fibers can be flexibly regulated by adjusting the processing conditions, thus exhibiting multimodal features³³. Therefore, we use electrospun nanofibers to create a benchmark dataset as a case study.

During the preparation process, we controlled the morphology and arrangement of the nanofibers by adjusting various combinations of flow rate, concentration, voltage, rotation speed, and ambient temperature, humidity (Fig. 1a). The microstructure was characterized using scanning electron microscopy (SEM). Subsequently, we tested the mechanical properties of the electrospun films in both the longitudinal and transverse directions using tensile tests, including fracture strength, yield strength, elastic modulus, tangent modulus, and fracture elongation. To facilitate the representation, a binary indicator was added to the processing conditions to specify the tensile direction. For more details on dataset construction, refer to the Methods section.

**Fig. 1: Overview of multimodal dataset construction, model pre-training, and downstream tasks.**

Structure-guided pre-training

In this study, we propose the use of multimodal learning to capture the complex, multi-level features of materials and alleviate the issue of data scarcity. However, as mentioned above, existing approaches may suffer from significant limitations in cross-modal understanding and are not designed to handle missing data, which is a common challenge in materials science.

Inspired by contrastive learning and multimodal learning in natural language processing^29,34 and computer vision³⁵, we propose a structure-guided pre-training (SGPT) strategy to align processing and structural modalities via a fused material representation (Fig. 1b). A table encoder models the nonlinear effects of processing parameters known to influence fiber formation in electrospinning. A vision encoder learns rich microstructural features of materials directly from raw SEM images in an end-to-end manner, capturing complex morphologies such as fiber alignment, diameter distribution, and porosity. A multimodal encoder integrates processing and structural information to construct a fused embedding representing the material system. For each sample, this fused representation serves as the anchor in contrastive learning, which is aligned with its corresponding unimodal embeddings (processing conditions and structures) as positive pairs, while embeddings from other samples serve as negatives. All embeddings are projected into a joint latent space via a projector head. This approach provides an efficient mechanism for improving the model’s robustness during inference, especially in scenarios where certain modalities (e.g., microstructural images) are missing. More importantly, SGPT enables the model to uncover potential correlations between the multiscale information of materials, thereby facilitating various downstream tasks (Fig. 1c).

Specifically, given a batch containing N samples, the processing conditions ${\{{{\bf{x}}}_{i}^{{\rm{t}}}\}}_{i=1}^{N}$, microstructure ${\{{{\bf{x}}}_{i}^{{\rm{v}}}\}}_{i=1}^{N}$, and fused inputs ${\{{{\bf{x}}}_{i}^{{\rm{t}}},{{\bf{x}}}_{i}^{{\rm{v}}}\}}_{i=1}^{N}$ are processed by a table encoder ${f}_{{\rm{t}}}(\cdot )$, a vision encoder ${f}_{{\rm{v}}}(\cdot )$, and a multimodal encoder ${f}_{{\rm{m}}}(\cdot )$, respectively, resulting in ${\{{{\bf{h}}}_{i}^{{\rm{t}}}\}}_{i=1}^{N},{\{{{\bf{h}}}_{i}^{{\rm{v}}}\}}_{i=1}^{N},{\{{{\bf{h}}}_{i}^{{\rm{m}}}\}}_{i=1}^{N}$, which correspond to the learned representations from the table, vision, and multimodal encoders. Next, a shared projector $g(\cdot )$ is employed to map the encoded representations into a joint space for multimodal contrastive learning, resulting in three sets of representations ${\{{{\bf{z}}}_{i}^{{\rm{t}}}\}}_{i=1}^{N},{\{{{\bf{z}}}_{i}^{{\rm{v}}}\}}_{i=1}^{N},{\{{{\bf{z}}}_{i}^{{\rm{m}}}\}}_{i=1}^{N}$, corresponding to the table, vision, and multimodal modalities, respectively. The fused representations ${\{{{\bf{z}}}_{i}^{{\rm{m}}}\}}_{i=1}^{N}$ are used as anchors to align information from other modalities. To construct contrastive pairs, embeddings derived from the same material, such as ${{\bf{z}}}_{i}^{{\rm{t}}}$ and ${{\bf{z}}}_{i}^{{\rm{m}}}$, are treated as positive pairs, while the remaining embeddings are considered negative pairs. A contrastive loss is applied to these latent vectors to jointly train the encoders and projector by maximizing the agreement between positive pairs while minimizing it for negative pairs. More details on SGPT can be found in the Methods section.

To illustrate the generality of MatMCL, we implemented two network architectures in this work. The first architecture utilizes a Multilayer Perceptron (MLP) to extract features of processing conditions and a Convolutional Neural Network (CNN) to extract microstructural features (Supplementary Fig. S1). The features from each modality are then concatenated to obtain a fused multimodal representation. Additionally, we also implemented a Transformer-based MatMCL model, where an FT-Transformer³⁶ serves as the table encoder and a Vision Transformer (ViT)³⁷ act as the vision encoder (Supplementary Fig. S2–S4). We incorporated a multimodal Transformer with cross-attention³⁸ to effectively capture interactions between processing conditions and structures to further improve multimodal representation learning. A consistent decrease in multimodal contrastive loss is observed during training (Supplementary Fig. S5), indicating that the model is progressively learning the underlying correlations between processing conditions and microstructures of nanofibers.

Mechanical property prediction

After SGPT, we first utilized the joint latent space for property prediction. In this stage, the pre-trained encoder and projector were loaded and kept frozen, while a trainable multi-task predictor was added to predict mechanical properties (Fig. 2a). Notably, while multimodal interactions enable models to capture multiscale information of materials during training, certain modalities such as spectroscopic and structural characterization data are often missing at inference time due to experimental constraints and acquisition costs. In the case of nanofiber materials, processing conditions are readily accessible and can even be arbitrarily selected within a certain range without the need to fabricate the corresponding materials. However, the acquisition of microstructures is a laborious process, requiring the preparation of materials and subsequent characterization using expensive imaging techniques. Therefore, model performance with missing structures is more important for virtual screening.

**Fig. 2: Predicting mechanical properties using the pre-trained material representations.**

We evaluated the test performance of MatMCL, implemented with both architectures, without prior exposure to the structural information of the test samples. MatMCL (fusion) consistently achieves the highest performance across all mechanical properties, as expected (Fig. 2b, c). This result reinforces the idea that the processing conditions and microstructure of nanofiber materials contain complementary information, thereby enhancing feature representations. Moreover, compared to conventional models without SGPT, MatMCL (conditions) achieves a significant improvement in R² and root-mean-square error (RMSE) for most mechanical properties on the test set. This finding suggests that the proposed pre-training strategy enables the model to learn higher-quality representations with stronger generalization capabilities. Additionally, as anticipated, MatMCL (conditions) has close or even equivalent performance to MatMCL (fusion) for most mechanical properties, further validating the effectiveness of the learned representations. We also conducted an ablation study (Supplementary Fig. S6), which confirms the importance of the multimodal encoder in improving prediction performance.

To validate the alignment between the processing conditions of nanofibers and the fused information, we computed the cosine similarity between the condition embeddings and the fused embeddings for all samples. The similarity matrix of MatMCL exhibits a distinct diagonal pattern, whereas the similarity matrices of models without alignment do not (Fig. 2d). This indicates that embeddings from the same material exhibit significantly higher similarity than those from different materials. We further quantified the alignment of the model using Improved Precision and Recall (IPR)³⁹, a metric designed to assess the geometric and topological properties of an evaluation set of representations in comparison to a reference set. A higher IPR value indicates greater similarity in both global and local structures between the two sets, i.e., better alignment. As shown in Fig. 2e, MatMCL exhibits strong alignment in both conditions and structures with multimodal representations, which is consistent with the training objective. Notably, although no explicit constraint is imposed between conditions and structures during training, an indirect correlation emerges through the fused representations as a medium. In addition, an interpretability analysis using SHapley Additive exPlanations (SHAP) and Grad-CAM was also performed (Supplementary Fig. S7–S12).

We further analyzed the latent space to investigate why MatMCL achieves superior performance in predicting mechanical properties. We visualized the material representations generated by different models using t-SNE (Fig. 2f). Each sample point was colored based on three manually defined structural features, including degree of orientation, nanofiber diameter, and pore size, extracted from the microstructure (Supplementary Table S1). Pearson correlation coefficients between representations and structural features were calculated. The results indicate that the original processing conditions exhibit the weakest correlation with structural features. In contrast, MatMCL (fusion) achieves the highest correlation due to the integration of microstructural information. Interestingly, MatMCL (conditions) shows a correlation comparable to that of MatMCL (fusion). Moreover, we also employ the K-nearest neighbors (KNN) algorithm to quantify this correlation, following the approach of previous studies⁴⁰. Specifically, 80% of the dataset is used to train a KNN regression model to predict the structural features of materials using representations generated by different models. In comparison, the remaining 20% is used to evaluate the model’s performance. The predictive capability of KNN relies on local similarity in the feature space, where higher performance indicates that similar representations correspond to similar microstructures. As shown in Fig. 2g, the representations of MatMCL (fusion) achieve the highest performance, followed by MatMCL (conditions), while the original processing conditions exhibit the worst performance. The results show that MatMCL (conditions) can better capture the information related to the structural features of nanofibers than the original processing conditions. Therefore, we conclude that the proposed methods can guide the model to learn structure-related information, thereby enhancing its ability to predict mechanical properties.

Cross-modal retrieval

After achieving property prediction, establishing an efficient retrieval mechanism can also help researchers optimize processing conditions and analyze structural characteristics. Furthermore, with advancements in experimental and computational methods, the variety and amount of material data are rapidly increasing. As a result, cross-modal retrieval is essential for the construction and application of large-scale multimodal materials databases⁴¹. Traditional retrieval methods primarily rely on numerical indexing or similarity matching, which limits the understanding and generalization of complex material systems. In the case of nanofibers presented in this study, our goal is to retrieve the corresponding microstructure given specific conditions, or to reverse-identify the conditions associated with a given microstructure.

Figure 2e demonstrates that through the binding of a specific modality with the anchor, conditions and structures of MatMCL are well aligned. Therefore, similar to the implementation of CLIP, we utilized the trained table encoder ${f}_{{\rm{t}}}(\cdot )$ and the vision encoder ${f}_{{\rm{v}}}(\cdot )$ to encode the query and gallery data into a joint space. The cosine similarity between the query and each sample in the gallery is calculated and ranked, with the top-ranked samples serving as the retrieval results, which is consistent with the objective of pre-training (Fig. 3a, b).

**Fig. 3: Cross-modal retrieval based on the pre-trained material representations.**

To better evaluate retrieval performance, we employed random retrieval and a similarity-based method as baselines. For the similarity-based method, given a condition ${\bf{q}}$, we identify the closest condition ${{\bf{q}}}^{{\boldsymbol{{\prime} }}}$ in the training data based on Euclidean distance⁴². The corresponding structure ${{\bf{p}}}^{{\boldsymbol{{\prime} }}}$ of ${{\bf{q}}}^{{\boldsymbol{{\prime} }}}$ is then used to retrieve the most similar structure ${\bf{p}}$ from the gallery, which serves as the final retrieval result. The dataset used for retrieval was not used for the training. MatMCL achieves significantly higher retrieval accuracy compared to the baselines, while the specific model architecture has little impact on the results (Fig. 3c, d). Therefore, MLP (CNN) is adopted in subsequent experiments of this study for computational efficiency. It is important to note that there is not a strict one-to-one correspondence between the processing conditions and structures of nanofibers; that is, very similar structures may be obtained from markedly different conditions. We present six retrieval examples, in which most cases successfully identify the true matching samples within the top-ranked retrieved results (highlighted in blue line) (Fig. 3e, f). In the first and third cases of structure retrieval, although the top-1 result is not the exact ground truth match, its structural features exhibit high visual similarity to the actual matching structure. These findings suggest that the model can effectively capture key structural patterns of nanofibers and retrieve samples with high similarity to the target structure. In conclusion, MatMCL demonstrates a strong capability to uncover complex correlations within multimodal materials data, highlighting its potential in advancing materials understanding and design.

Conditional structure generation

Although cross-modal retrieval enables researchers to quickly locate material samples that meet specific criteria within existing datasets, it is inherently limited by the coverage of the database. When relevant experimental data is absent, retrieval fails to yield meaningful results. Thus, cross-modal generation holds promise for overcoming this limitation. In this study, we aim to generate possible microstructures directly based on the given processing conditions.

Inspired by the architecture of DALLE-2, a novel text-to-image generation model, we developed a cross-modal generation module for materials science based on a prior model and a decoder. In this pipeline, we first loaded the pre-trained encoder from SGPT to encode processing conditions and structures into the joint latent space (Fig. 4a). Next, a prior model was trained to map these condition embeddings ${{\bf{z}}}_{i}^{{\rm{t}}}$ to structure embeddings ${{\bf{z}}}_{i}^{{\rm{v}}}$, bridging the distribution gap between the two modalities. Separately, a diffusion-based decoder was trained to generate microstructures from structure embeddings ${{\bf{z}}}_{i}^{{\rm{v}}}$^43,44. During inference, the generation process began with random noise and progressively denoised it under the guidance of structure embeddings predicted from the given processing conditions, ultimately producing the target microstructure. Notably, the prior model and decoder were trained independently.

**Fig. 4: Conditional structure generation and inverse design.**

To evaluate the quality of the generated structures and their alignment with the given conditions, we employed the Frechet Inception Distance (FID), a commonly used metric for measuring the distributional difference between generated and real images⁴⁵. A lower FID value indicates a higher similarity between distributions. For each condition in the dataset, ten structures were generated, and their features were extracted using the trained vision encoder ${f}_{{\rm{v}}}(\cdot )$. The FID between the real and generated structures was then computed (Fig. 4b). For conditions seen by the model, the FID values along the diagonal of the heatmap are significantly lower than those in the surrounding areas. This result indicates that the given prompts effectively guide the model in generating the corresponding structures, and the generated results are highly consistent with real structures. For unseen conditions, the FID values remain low despite the increased challenge, suggesting that the model exhibits strong generalization capabilities even for previously unencountered conditions. We then computed the structural features (orientation degree, fiber diameter, and pore size) to more intuitively assess the reliability of the generated structures (Fig. 4c). The generated structures exhibit a strong correlation with the real structures in all three features, regardless of whether the conditions were seen or unseen. We also present eight sets of unseen instances (Fig. 4d). In each set, we displayed the input conditions, the corresponding real structure, and the generated structure along with its features. It can be observed that the generated structures exhibit similar features to the real ones, making them nearly indistinguishable to the naked eye. Some generated structures exhibit slightly curved or thick fibers, which are also observed in the training data due to experimental effects such as jet splitting⁴⁶, hygroscopicity⁴⁷, and local concentration variations^48,49. These deviations may also be partly influenced by the absence of explicit physics priors in the generative model. These results indicate that MatMCL not only accurately learns the distribution of seen data but also infers and generates structures that conform to physical and statistical principles under specified unseen conditions.

To generate structures with specified features and enable inverse material design, we further developed a gradient-based material structure optimization process. To improve optimization efficiency, we first trained a proxy to rapidly predict the three structural features corresponding to given processing conditions. Researchers can specify desired structural features (in this example, an orientation degree of 0.75, a fiber diameter of 0.15 μm, and a pore size of 0.48 μm) and then employ the gradient descent algorithm to optimize processing conditions based on the proxy (Fig. 4e). Finally, the optimized conditions and the generated structure are returned. As the conditions are optimized, the material structure progressively approaches the target features (Fig. 4f). The generated structure changes from being randomly oriented to exhibiting a distinct orientation. To further demonstrate the flexibility of the inverse design pipeline, we performed a mechanical property-driven variant under fixed environmental constraints (temperature and humidity), optimizing conditions and generating corresponding structures that satisfy the target mechanical properties (Supplementary Fig. S7). These results demonstrate a strong capability for cross-modal understanding and structure generation, supporting intelligent material design beyond database limitations.

Nanofiber-reinforced composite material design

Nanofiber-reinforced composites exhibit excellent mechanical properties, good environmental resistance, and tunable functionality, making them highly promising for applications in biomedical engineering, flexible electronics, and other fields⁵⁰. However, the fabrication of composite materials involves numerous process parameters, complex manufacturing routes, and high experimental costs with long verification cycles, significantly limiting the rapid development and practical application of novel composites. Moreover, the scarcity of high-quality experimental data further increases the difficulty of data-driven material design. After successfully developing the functional modules of MatMCL, we applied it to the design of nanofiber-reinforced composites to further validate its powerful capabilities and broad applicability.

As depicted in Fig. 5a, electrospun nanofibers were immersed in a silicone-based polyurethane solution and subsequently dried to obtain nanofiber-reinforced composite films. Due to the laborious preparation process, we only collected 40 samples, posing a considerable obstacle for AI modeling. To address the issue of extreme data scarcity in composite materials, we employ a multi-stage learning (MSL) strategy⁵¹. As demonstrated in Fig. 2f, g, the table encoder ${f}_{{\rm{t}}}(\cdot )$ of MatMCL acquires structural knowledge of nanofibers through SGPT, which is essential for understanding material properties. The pre-trained encoder is then fine-tuned to learn the mechanical properties of nanofibers. Notably, during composite fabrication, the morphology of the nanofibers remains largely unchanged; ideally, only the air among fibers is replaced by the polymer matrix. Therefore, we hypothesize that the learned structural and mechanical properties of nanofibers can also contribute to predicting the mechanical properties of composites. Finally, the model was fine-tuned using composite material data to adapt it to the new task (Fig. 5b). The performance of MSL was evaluated on the test set (Fig. 5c, d). Models trained from random initialization fail to converge due to the extreme scarcity of data, making them impractical for real-world applications. In contrast, MSL achieves a much lower prediction error compared to other baselines, demonstrating its ability to generalize well despite the limited dataset. MSL simulates the hierarchical evolution of composite materials from fabrication to structure and ultimately to properties, by progressively learning the structural features of nanofibers, their mechanical properties, and the mechanical properties of the resulting composites, thereby enhancing its predictive performance on composite materials. Moreover, the ablation study in Fig. 5c, d further confirms the importance of each pre-training stage, as removing any stage leads to a noticeable drop in performance.

**Fig. 5: Nanofiber-reinforced composite material design.**

Finally, we employed the trained model to screen composite materials. To evaluate its reliability, we selected two sets of randomly generated conditions for validation. Additionally, we identified three sets of conditions predicted to possess the high mechanical anisotropy, which hold potential applications in tissue engineering⁵². We further fabricated and characterized these five samples, with one representative high-anisotropy instance exhibited here (Fig. 5e). The gaps among the nanofibers were completely filled with the polymer matrix, resulting in a smooth surface with no exposed fibers (Fig. 5f). The fracture strength in the longitudinal and transverse directions displayed pronounced anisotropy (Fig. 5g). Notably, the predicted fracture strengths for both the nanofibers and the composite materials closely match the measured values (Fig. 5h). Furthermore, the prediction errors for all samples remained within a narrow range, confirming the model’s high accuracy and robustness (Fig. 5i). These findings indicate that the proposed method can effectively guide the design of nanofibers and nanofiber-reinforced composites. In summary, MatMCL has strong scalability and can be extended to more complex material systems.

Discussion

In this study, we focused on electrospun nanofiber materials as a representative system to address a critical challenge in AI-driven materials design: the integration of multiscale and multimodal information under data scarcity and missing modalities. To this end, we proposed MatMCL, a versatile framework that leverages multimodal contrastive learning to bridge diverse material knowledge sources. Rather than relying on complete inputs, MatMCL demonstrates strong robustness to missing modalities, while enabling accurate property prediction, flexible cross-modal retrieval, and structure generation. Notably, MatMCL is capable of aligning an arbitrary number of modalities, not limited to just two, indicating strong scalability and adaptability.

Our current evaluation is limited to electrospun nanofibers, which were selected for their relatively facile fabrication process, tunable morphology, and inherent multimodal characteristics. Nevertheless, MatMCL is designed to be material-agnostic and future efforts will focus on extending and validating the framework across a broader range of material systems. Notably, multimodal data are pervasive in the field of materials science, encompassing chemical composition, processing conditions, structures, and spectroscopic features. This makes MatMCL a promising foundation for AI-driven material discovery pipelines, particularly suited to handling missing modalities and modeling the multi-level correlations inherent in complex material systems.

Methods

Materials

Nylon-6 (PA 6) was purchased from Macklin Inc. The silicone-based polyurethane (Si content: 20 wt%, Shore A hardness: 80) was sourced from Dongguan Fulin Plastic Materials Co., Ltd. (China). N,N-Dimethylformamide (DMF), formic acid (FA, ≥98 wt%) and acetic acid (AcOH, ≥98 wt%), were acquired by Sigma-Aldrich. All the reagents were of analytical grade and were directly used as received without further purification.

Preparation of electrospun nanofibers

First, PA 6 was added to a closed vessel containing a mixture of FA/AcOH (mass fraction ratio 2:3). The mixture was stirred at 300 rpm for 12 h at 35 °C to obtain PA 6 spinning solution (named spinning solution I) at a concentration of x₁. Subsequently, spinning solution I was added to a 5 mL syringe. Finally, electrostatic spinning was carried out on a horizontal electrostatic spinning device (Changsha Nayi Instrument Technology Co., Ltd., China, JDF-05). The rotating voltage was adjusted to x₂ kV, the rotating distance was 15–20 cm, the drum rotational speed was x₃ rpm, the nozzle moved horizontally at 20 cm, and the rotating liquid flow rate was x₄ mL/h, and the temperature x₅ and humidity x₆ were recorded. After 3 h, the electrospun PA 6 nanofiber membranes were taken out of the drum sequentially and dried in a vacuum oven for 12 h to remove the residual solvent for subsequent experiments. We selected a total of 235 sets of processing conditions ${\{{{\bf{x}}}_{i}^{{\rm{t}}}=({x}_{i,1},{x}_{i,2},\mathrm{..}.,{x}_{i,6})\}}_{i=1}^{N}$.

Preparation of nanofibers-reinforced composites

The composite membrane was fabricated by initially filling a 6 cm × 6 cm square quartz cell with 2.5 mL of silicone-based polyurethane solution (50 mg/mL). A 6 cm × 6 cm electrospun membrane was subsequently carefully placed onto the liquid surface to ensure uniform coverage, followed by the gradual addition of another 2.5 mL of polyurethane solution (50 mg/mL) to fully encapsulate the electrospun layer. The entire assembly was then transferred to a convection oven at 70 °C, where it underwent hot-air drying for 12 h to achieve complete solidification and structural integration.

Dataset construction

Processing conditions

Continuous processing conditions were standardized using a Standard Scaler, while binary features (direction indicator) were left unchanged.

Structure characterization

The structure of the electrospun nanofibers was examined using scanning electron microscopy (SEM, SU5000, Hitachi). For each sample, two arbitrary locations were selected, from which square pieces measuring 5 mm × 5 mm were carefully cut. SEM imaging was conducted at a magnification of 1000×, and images were captured at a resolution of 2560 × 1920 pixels. At least four images were acquired per sample to ensure representative structural features. Each SEM image was further divided into 15 non-overlapping patches of 512 × 512 pixels for subsequent analysis. To reduce bias caused by differences in brightness and contrast during SEM imaging, all cropped image patches were normalized via histogram matching algorithm.

Data annotation

The mechanical properties of the electrospun nanofibers were measured using a universal testing machine (UTM2102, Shenzhen Suns Technology Stock, Shenzhen, China). Samples were cut into rectangular strips with dimensions of 3 cm in length and 1 cm in width. Each sample was tested with a stretch rate of 20 mm/min. From the stress-strain curves, five mechanical properties were extracted. The fracture strength was defined as the maximum stress recorded during the tensile test. Elongation at break was calculated as the strain corresponding to this maximum stress. The elastic modulus was determined from the initial linear portion of the curve by calculating the slope at 10% strain. The yield strength was identified as the stress at the intersection between the stress-strain curve and a line offset by 1% strain from the initial elastic modulus. The tangent modulus was defined by performing a linear fit to the stress-strain curve in the region after yielding. For each film, tensile tests were conducted in both the longitudinal and transverse directions, with three replicates performed in each direction. The average of the three measurements was reported as the representative mechanical properties for each direction. The structure of the dataset is shown in Supplementary Table S2.

Structure-guided pre-training (SGPT)

SGPT based on multimodal contrastive learning was implemented to align modality-specific and fused representations. Given a mini batch of N samples, each sample includes processing parameters x_t, microstructures image x_v. These inputs are encoded as:

$${{\boldsymbol{h}}}_{{\rm{t}}}={f}_{{\rm{t}}}({{\boldsymbol{x}}}_{{\rm{t}}}),{{\boldsymbol{h}}}_{{\rm{v}}}={f}_{{\rm{v}}}({{\boldsymbol{x}}}_{{\rm{v}}}),{{\boldsymbol{h}}}_{{\rm{m}}}={f}_{{\rm{m}}}({{\boldsymbol{x}}}_{{\rm{t}}},{{\boldsymbol{x}}}_{{\rm{v}}})$$

where ${f}_{{\rm{t}}}(\cdot ),{f}_{{\rm{v}}}(\cdot ),{f}_{{\rm{m}}}(\cdot )$ denote table encoder, vision encoder, multimodal encoder, respectively. All encoded vectors are projected to a joint latent space via a shared projection head $g(\cdot )$:

$${{\boldsymbol{z}}}_{{\rm{t}}}=g({{\boldsymbol{h}}}_{{\rm{t}}}),{{\boldsymbol{z}}}_{{\rm{v}}}=g({{\boldsymbol{h}}}_{{\rm{v}}}),{{\boldsymbol{z}}}_{{\rm{m}}}=g({{\boldsymbol{h}}}_{{\rm{m}}})$$

The fused representation h_m serves as the anchor for contrastive learning. For each modality, positive pairs are formed with the anchor of the same sample, while embeddings from other samples in the batch are treated as negatives. The contrastive loss for a single positive pair $({{\boldsymbol{h}}}_{{\rm{s}}}^{i},{{\boldsymbol{h}}}_{{\rm{m}}}^{i})$, where ${\rm{s}}\in \{{\rm{t}},{\rm{v}}\}$, is defined as³⁴:

$$\begin{array}{c}{{\mathcal{L}}}_{{\rm{s}},{\rm{m}}}^{i}=-\,\log \frac{\exp ({{\boldsymbol{h}}}_{{\rm{s}}}^{i}{{\boldsymbol{h}}}_{{\rm{m}}}^{i}/\tau )}{{\sum }_{i\ne j}\exp ({{\boldsymbol{h}}}_{{\rm{s}}}^{i}{{\boldsymbol{h}}}_{{\rm{s}}}^{j}/\tau )+{\sum }_{j}\exp ({{\boldsymbol{h}}}_{{\rm{s}}}^{i}{{\boldsymbol{h}}}_{{\rm{m}}}^{j}/\tau )}\\ {{\mathcal{L}}}_{{\rm{s}}}^{i}={{\mathcal{L}}}_{{\rm{s}},{\rm{m}}}^{i}+{{\mathcal{L}}}_{{\rm{m}},{\rm{s}}}^{i}\end{array}$$

where τ is the temperature hyperparameter. The total loss is computed by summing over all samples and modalities:

$${{\mathcal{L}}}_{\mathrm{total}}=\mathop{\sum }\limits_{i=1}^{N}({{\mathcal{L}}}_{{\rm{t}}}^{i}+{{\mathcal{L}}}_{{\rm{v}}}^{i})$$

Two architectures were implemented for encoding and fusion:

Convolutional style

A 5-layer multilayer perceptron (MLP) with ReLU activation was used to encode tabular processing conditions, and a ResNet-50 model was used to extract microstructure features⁵³. The multimodal fusion encoder is constructed by extracting features from both MLP and ResNet-50.

Transformer style

The tabular encoder is implemented using an FT-Transformer, and the image encoder is a Vision Transformer (ViT). The two modality-specific embeddings are fused via a multimodal Transformer with cross-attention layers.

All encoder modules output 128-dimensional feature vectors and apply a dropout rate of 0.1. The model was pre-trained using the Adam optimizer with a learning rate of 1 × 10⁻⁴ for 200 epochs. The implementation was based on PyTorch and all experiments were conducted on NVIDIA RTX 4090 GPU.

Mechanical properties prediction

Following structure-guided pre-training, the tabular encoder was frozen, and a randomly initialized multi-task predictor was added for mechanical properties prediction. Both the Convolutional style and Transformer style architectures were evaluated. The dataset was randomly divided into training, validation, and test sets using a 7:1.5:1.5 split. Model training was performed using the Adam optimizer with a learning rate of 5 × 10^-4, weight decay of 1 × 10^-4, and a batch size of 32. Early stopping with a patience of 20 epochs was applied. All hyperparameters were selected via grid search. The loss function was defined as the mean squared error (MSE) summed across all predicted properties.

Cross-modal retrieval

Cross-modal retrieval was performed using the pre-trained table and vision encoders. Given a query from one modality (either processing conditions or microstructure image), both the query and all gallery samples were encoded into the joint latent space using the respective encoders. Cosine similarity was computed between the query embedding and all gallery embeddings. Samples in the gallery were then ranked based on similarity scores, and the top-k ranked results were returned as retrieval outputs. Retrieval was performed in both directions: from conditions to microstructure, and from microstructure to conditions.

Conditional structure generation

To generate microstructure images from processing conditions, we implemented a two-stage cross-modal generative pipeline based on latent diffusion modeling. The system consists of a diffusion prior module that maps condition embeddings to structure embeddings, followed by a decoder that translates the predicted latent embedding into a full-resolution microstructure image.

Diffusion prior

Given the condition inputs x_t, the latent embedding ${{\bf{z}}}_{{\rm{t}}}=g({f}_{{\rm{t}}}({{\bf{x}}}_{{\rm{t}}}))$ was extracted using the pre-trained encoder. The ground truth structure embedding ${{\bf{z}}}_{{\rm{v}}}=g({f}_{{\rm{v}}}({{\bf{x}}}_{{\rm{v}}}))$ was obtained via the vision encoder. During training, the prior model was trained to reconstruct ${{\bf{z}}}_{{\rm{v}}}$ from its noisy version ${{\bf{z}}}_{{\rm{v}}}^{(t)}$, using the following loss function:

$${{\mathcal{L}}}_{\mathrm{prior}}={{\mathbb{E}}}_{{{\bf{z}}}_{{\rm{v}}},{{\bf{z}}}_{{\rm{t}}},\epsilon ,t}[{\Vert {{\bf{z}}}_{{\rm{v}}}-{\epsilon }_{\theta }({{\bf{z}}}_{{\rm{v}}}^{(t)},t|{{\bf{z}}}_{{\rm{t}}})\Vert }_{2}^{2}]$$

where ${\epsilon }_{\theta }$ was a causal Transformer, ${{\bf{z}}}_{{\rm{v}}}^{(t)}=\sqrt{{\bar{\alpha }}_{t}}{{\bf{z}}}_{{\rm{v}}}+\sqrt{1-{\bar{\alpha }}_{t}}\epsilon$, with $\epsilon \sim {\mathscr{N}}(0,I)$, and ${\bar{\alpha }}_{t}$ defined by a cosine noise schedule. Classifier-free guidance (CFG) was enabled by randomly dropping the condition embedding during training. At inference, sampling was applied using:

$${\hat{\epsilon }}_{\theta }=(1+w)\cdot {\epsilon }_{\theta }(\cdot |{{\bf{z}}}_{{\rm{t}}})-w\cdot {\epsilon }_{\theta }(\cdot |\varnothing )$$

where w is the guidance scale hyperparameter (w = 1). Sampling was conducted using DDPM with cosine noise schedule, and all modules were trained using the Adam optimizer with a learning rate of 1 × 10^-4, weight decay of 1 × 10^-2, and gradient clipping (max norm 0.5). The resulting latent vector ${\hat{{\bf{z}}}}_{{\rm{v}}}$ was then used to guide the subsequent image-level structure generation via a separate diffusion decoder.

Diffusion decoder

The second stage was a conditional diffusion decoder that generated microstructure x_v from structure embeddings z_v. The decoder was also trained using the DDPM framework. For a given real structure x_v, the noisy input is:

$${{\bf{x}}}_{{\rm{v}}}^{(t)}=\sqrt{{\bar{\alpha }}_{t}}{{\bf{x}}}_{{\rm{v}}}+\sqrt{1-{\bar{\alpha }}_{t}}\epsilon ,\epsilon \sim {\mathscr{N}}(0,I)$$

and the network is trained to minimize the conditional denoising loss:

$${{\mathcal{L}}}_{\mathrm{decoder}}={{\mathbb{E}}}_{{{\bf{x}}}_{{\rm{v}}},\epsilon ,t}[{\Vert \epsilon -{\epsilon }_{\phi }({{\bf{x}}}_{{\rm{v}}}^{(t)},t|{{\bf{z}}}_{{\rm{v}}})\Vert }_{2}^{2}]$$

Inference is performed using the same CFG mechanism:

$${\hat{\epsilon }}_{\phi }=(1+w)\cdot {\epsilon }_{\phi }(\cdot |{{\bf{z}}}_{{\rm{v}}})-w\cdot {\epsilon }_{\phi }(\cdot |\varnothing ),w=1$$

The decoder was trained for 50 epochs using the Adam optimizer with a learning rate of 1 × 10^-4, weight decay of 1 × 10^-2, and gradient clipping (max norm 0.5).

Finally, the full generation process can be formally expressed as a two-stage, autoregressive sampling procedure:

$${{\bf{x}}}_{{\rm{v}}} \sim \mathop{\prod }\limits_{t=1}^{T}{p}_{\phi }({{\bf{x}}}^{(t-1)}|{{\bf{x}}}^{(t)},t,{{\bf{z}}}_{{\rm{v}}}),\,{\text{where}}\,{{\bf{z}}}_{{\rm{v}}} \sim \mathop{\prod }\limits_{t=1}^{T}{p}_{\theta }({{\bf{z}}}^{(t-1)}|{{\bf{z}}}^{(t)},t,{{\bf{z}}}_{{\rm{t}}})$$

Inverse design

To enable inverse design, a proxy model ${\mathcal{P}}({{\bf{x}}}_{{\rm{t}}})$ was trained to predict structural features from processing conditions x_t, enabling fast and differentiable approximation of the condition-structure relationship. Given a target feature vector ${{\boldsymbol{y}}}_{\mathrm{target}}\in {{\mathbb{R}}}^{3}$, the condition x_t was iteratively optimized to minimize the prediction error. At each step t, the gradient

$${\nabla }_{{{\bf{x}}}_{{\rm{t}}}}{\Vert {\mathcal{P}}({{\bf{x}}}_{{\rm{t}}})-{{\bf{y}}}_{\mathrm{target}}\Vert }_{2}^{2}$$

was computed via backpropagation, and the condition vector was updated for 120 steps, using the Adam optimizer with a learning rate of 0.1.

Data availability

All data are available in the main text or the Supplementary Information. The trained model and preprocessed datasets are available at https://figshare.com/s/0cad763a26f928b70840 for readers to reproduce and use.

Code availability

The code of the work is available in the GitHub repository at https://github.com/wuyuhui-zju/MatMCL.

References

Batra, R., Song, L. & Ramprasad, R. Emerging materials intelligence ecosystems propelled by machine learning. Nat. Rev. Mater. 6, 655–678 (2021).
Article Google Scholar
Guo, K., Yang, Z., Yu, C.-H. & Buehler, M. J. Artificial intelligence and machine learning in design of mechanical materials. Mater. Horiz. 8, 1153–1172 (2021).
Article CAS PubMed Google Scholar
Maqsood, A., Chen, C. & Jacobsson, T. J. The future of material scientists in an age of artificial intelligence. Adv. Sci. 11, 2401401 (2024).
Article CAS Google Scholar
Merchant, A. et al. Scaling deep learning for materials discovery. Nature 624, 80–85 (2023).
Article CAS PubMed PubMed Central Google Scholar
Szymanski, N. J. et al. An autonomous laboratory for the accelerated synthesis of novel materials. Nature 624, 86–91 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zeni, C. et al. A generative model for inorganic materials design. Nature 639, 624–632 (2025).
Article CAS PubMed PubMed Central Google Scholar
Hart, G. L. W., Mueller, T., Toher, C. & Curtarolo, S. Machine learning for alloys. Nat. Rev. Mater. 6, 730–755 (2021).
Article Google Scholar
Rao, Z. et al. Machine learning-enabled high-entropy alloy discovery. Science 378, 78–85 (2022).
Article CAS PubMed Google Scholar
Jiang, L. et al. A rapid and effective method for alloy materials design via sample data transfer machine learning. npj Comput. Mater. 9, 26 (2023).
Article CAS Google Scholar
Reker, D. et al. Computationally guided high-throughput design of self-assembling drug nanoparticles. Nat. Nanotechnol. 16, 725–733 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tao, H. et al. Nanoparticle synthesis assisted by machine learning. Nat. Rev. Mater. 6, 701–716 (2021).
Article Google Scholar
Wei, Y. et al. Prediction and design of nanozymes using explainable machine learning. Adv. Mater. 34, 2201736 (2022).
Article CAS Google Scholar
Tamasi, M. J. et al. Machine learning on a robotic platform for the design of polymer–protein hybrids. Adv. Mater. 34, 2201809 (2022).
Article CAS Google Scholar
Gurnani, R. et al. AI-assisted discovery of high-temperature dielectrics for energy storage. Nat. Commun. 15, 6107 (2024).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. Machine learning-accelerated discovery of heat-resistant polysulfates for electrostatic energy storage. Nat. Energy 10, 90–100 (2024).
Article CAS Google Scholar
Ge, W., De Silva, R., Fan, Y., Sisson, S. A. & Stenzel, M. H. Machine Learning in Polymer Research. Adv. Mater. 37, 2413695 (2025).
Article CAS PubMed PubMed Central Google Scholar
Huang, J. et al. Identification of potent antimicrobial peptides via a machine-learning pipeline that mines the entire space of peptide sequences. Nat. Biomed. Eng. 7, 797–810 (2023).
Article CAS PubMed Google Scholar
Hao, H. et al. A paradigm for high-throughput screening of cell-selective surfaces coupling orthogonal gradients and machine learning-based cell recognition. Bioact. Mater. 28, 1–11 (2023).
CAS PubMed PubMed Central Google Scholar
Choudhary, K. et al. Recent advances and applications of deep learning methods in materials science. npj Comput. Mater. 8, 59 (2022).
Article Google Scholar
Xu, P., Ji, X., Li, M. & Lu, W. Small data machine learning in materials science. npj Comput. Mater. 9, 42 (2023).
Article Google Scholar
Ngiam, J. et al. Multimodal deep learning. In Proc. 28th International Conference on Machine Learning 689–696 (2011).
Lee, N. et al. Density of states prediction of crystalline materials via prompt-guided multi-modal transformer. Adv. Neural Inform. Process. Syst. 36, 61678–61698 (2023).
Das, K., Goyal, P., Lee, S.-C., Bhattacharjee, S. & Ganguly, N. Crysmmnet: multimodal representation for crystal property prediction. In Proc. 39th Conference on Uncertainty in Artificial Intelligence 507–517 (2023).
Zhang, Z. et al. Multimodal deep-learning framework for accurate prediction of wettabilityevolution of laser-textured surfaces. ACS Appl. Mater. Interfaces 15, 10261–10272 (2023).
Article CAS PubMed Google Scholar
Zhu, L. et al. Prediction of ultimate tensile strength of Al-Si alloys based on multimodal fusion learning. MGE Adv. 2, e26 (2024).
CAS Google Scholar
Muroga, S., Miki, Y. & Hata, K. A comprehensive and versatile multimodal deep-learning approach for predicting diverse properties of advanced materials. Adv. Sci. 10, 2302508 (2023).
Article Google Scholar
Wang, C. et al. Combinatorial discovery of antibacterials via a feature-fusion based machine learning workflow. Chem. Sci. 15, 6044–6052 (2024).
Article CAS PubMed PubMed Central Google Scholar
Wu, R., Wang, H., Chen, H.-T. & Carneiro, G. Deep multimodal learning with missing modality: a survey. arXiv preprint https://doi.org/10.48550/arXiv.2409.07825 (2024).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning 8748–8763 (2021).
Girdhar, R. et al. Imagebind: One embedding space to bind them all. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 15180–15190 (2023).
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint https://doi.org/10.48550/arXiv.2204.06125 (2022).
Xue, J., Wu, T., Dai, Y. & Xia, Y. Electrospinning and electrospun nanofibers: Methods, materials, and applications. Chem. Rev. 119, 5298–5415 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ji, D. et al. Electrospinning of nanofibres. Nat. Rev. Method. Prim. 4, 1 (2024).
Article CAS Google Scholar
Poklukar, P. et al. Geometric multimodal contrastive representation learning. In Proc. 39th International Conference on Machine Learning 17782–17800 (2022).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In Proc. 37th International Conference on Machine Learning 1597–1607 (2020).
Gorishniy, Y., Rubachev, I., Khrulkov, V. & Babenko, A. Revisiting deep learning models for tabular data. Adv. Neural Inform. Process. Syst. 34, 18932–18943. (2021).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. 9th International Conference on Learning Representations (2021).
Tsai, Y.-H. H. et al. Multimodal transformer for unaligned multimodal language sequences. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 6558–6569 (2019).
Kynkaanniemi, T., Karras, T., Laine, S., Lehtinen, J. & Aila, T. Improved precision and recall metric for assessing generative models. Adv. Neural Inform. Process. Syst 32, 3927–3936 (2019).
Google Scholar
Li, H. et al. A knowledge-guided pre-training framework for improving molecular representation learning. Nat. Commun. 14, 7568 (2023).
Article CAS PubMed PubMed Central Google Scholar
Sanchez-Fernandez, A., Rumetshofer, E., Hochreiter, S. & Klambauer, G. CLOOME: contrastive learning unlocks bioimaging databases for queries with chemical structures. Nat. Commun. 14, 7339 (2023).
Article CAS PubMed PubMed Central Google Scholar
Liu, S. et al. Multi-modal molecule structure–text model for text-based retrieval and editing. Nat. Mach. Intell. 5, 1447–1457 (2023).
Article Google Scholar
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inform. Process. Syst. 33, 6840-6851 (2020).
Ho, J. & Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
Obukhov, A. & Krasnyanskiy, M. Quality assessment method for GAN based on modified metrics inception score and Fréchet inception distance. In Proc. 4th Computational Methods in Systems and Software 102–114 (2020).
Yarin, A. L., Koombhongse, S. & Reneker, D. H. Taylor cone and jetting from liquid droplets in electrospinning of nanofibers. J. Appl. Phys. 90, 4836–4846 (2001).
Article CAS Google Scholar
Zuo, W. et al. Experimental study on relationship between jet instability and formation of beaded fibers during electrospinning. Polym. Eng. Sci. 45, 704–709 (2005).
Article CAS Google Scholar
Shenoy, S. L., Bates, W. D., Frisch, H. L. & Wnek, G. E. Role of chain entanglements on fiber formation during electrospinning of polymer solutions: good solvent, non-specific polymer–polymer interaction limit. Polymer 46, 3372–3384 (2005).
Article CAS Google Scholar
Dosunmu, O., Chase, G. G., Kataphinan, W. & Reneker, D. Electrospinning of polymer nanofibres from multiple jets on a porous tubularsurface. Nanotechnology 17, 1123 (2006).
Article CAS PubMed Google Scholar
Jiang, S. et al. Electrospun nanofiber reinforced composites: a review. Polym. Chem. 9, 2685–2720 (2018).
Article CAS Google Scholar
Liu, T., Feng, F. & Wang, X. Multi-stage pre-training over simplified multimodal pre-training models. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vol. 1, 2556–2565 (Association for Computational Linguistics, 2021).
Agarwal, S., Wendorff, J. H. & Greiner, A. Progress in the field of electrospinning for tissue engineering applications. Adv. Mater. 21, 3343–3351 (2009).
Article CAS PubMed Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2022YFB3807300), Zhejiang Provincial Natural Science Foundation of China (LR25E030001), the Key Research and Development Project of Zhejiang Province (2024C03073), the financial support from the State Key Laboratory of Transvascular Implantation Devices (012024019), and Transvascular Implantation Devices Research Institute China (TIDRIC) (KY012024007, KY012024009).

Author information

Authors and Affiliations

Department of Polymer Science and Engineering, MOE Key Laboratory of Macromolecule Synthesis and Functionalization, Zhejiang University, Hangzhou, PR China
Yuhui Wu, Haonan He, Peng Zhang & Jian Ji
International Research Center for X Polymers, International Campus, Zhejiang University, Haining, PR China
Yuhui Wu, Haonan He, Peng Zhang & Jian Ji
Jiangsu Co-Innovation Center of Efficient Processing and Utilization of Forest Resources, International Innovation Center for Forest Chemicals and Materials, College of Materials Science and Engineering, Nanjing Forestry University, Nanjing, PR China
Minmin Ding, Qijun Wu & Shaohua Jiang
State Key Laboratory of Transvascular Implantation Devices, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, PR China
Haonan He, Peng Zhang & Jian Ji
Transvascular Implantation Devices Research Institute China, Hangzhou, PR China
Jian Ji

Authors

Yuhui Wu
View author publications
Search author on:PubMed Google Scholar
Minmin Ding
View author publications
Search author on:PubMed Google Scholar
Haonan He
View author publications
Search author on:PubMed Google Scholar
Qijun Wu
View author publications
Search author on:PubMed Google Scholar
Shaohua Jiang
View author publications
Search author on:PubMed Google Scholar
Peng Zhang
View author publications
Search author on:PubMed Google Scholar
Jian Ji
View author publications
Search author on:PubMed Google Scholar

Contributions

J.J. and P.Z. conceptualized, supervised, and found the project. Y.W. designed the overall framework and was responsible for the development, training, and analysis of the machine learning method, as well as the characterization of mechanical properties and microstructure. M.D. and Q.W. prepared the electrospun nanofibers. H.H. was responsible for the fabrication of nanofiber-reinforced composite materials. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Peng Zhang or Jian Ji.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wu, Y., Ding, M., He, H. et al. A versatile multimodal learning framework bridging multiscale knowledge for material design. npj Comput Mater 11, 276 (2025). https://doi.org/10.1038/s41524-025-01767-3

Download citation

Received: 10 June 2025
Accepted: 07 August 2025
Published: 25 August 2025
DOI: https://doi.org/10.1038/s41524-025-01767-3

Subjects

Abstract

Similar content being viewed by others

Multiscale computational framework linking alloy composition to microstructure evolution via machine learning and nanoscale analysis

Multi-modal Dataset of a Polycrystalline Metallic Material: 3D Microstructure and Deformation Fields

Revealing nanostructures in high-entropy alloys via machine-learning accelerated scalable Monte Carlo simulation

Introduction

Results

Multimodal dataset construction

Structure-guided pre-training

Mechanical property prediction

Cross-modal retrieval

Conditional structure generation

Nanofiber-reinforced composite material design

Discussion

Methods

Materials

Preparation of electrospun nanofibers

Preparation of nanofibers-reinforced composites

Dataset construction

Processing conditions

Structure characterization

Data annotation

Structure-guided pre-training (SGPT)

Convolutional style

Transformer style

Mechanical properties prediction

Cross-modal retrieval

Conditional structure generation

Diffusion prior

Diffusion decoder

Inverse design

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links