Main

Promising new research directions often arise from combining concepts that have not previously been investigated together1. While experienced scientists possess vast domain knowledge enabling them to thoroughly explore research topics within (and adjacent to) their area(s) of expertise, finding new connections between their research topics and other yet unfamiliar topics to foster new ideas and findings is inherently challenging. Machine learning (ML) methods can help to look beyond the personal area of expertise by identifying previously unthought-of combinations of research topics and thus enable the exploration of a vast hypothesis space beyond human intuition2,3.

Scientific information is contained in a plethora of research publications in a rich but unstructured manner, and this lack of structured information poses challenges for automated analysis4,5. Focusing on the extensive domain of material science, we first examine how to systematically extract the main concepts of scientific articles, namely keywords or key phrases. Recent breakthroughs in natural language processing now allow us to extract structured data from text and process it automatically6,7,8,9,10,11. Here we investigate whether large language models (LLMs) can offer improvements over traditional algorithmic methods in this extraction process.

After identifying and extracting the concepts and their connections (that is, the co-occurrence in the same article), we investigate how to use this information to predict new combinations of concepts. In a previous study, Krenn et al. proposed SemNet, a graph that tracks the evolution of scientific literature in the domain of quantum physics12. The nodes of the SemNet graph are concepts, that is, keywords extracted from text, using an algorithm called rapid automatic keyword extraction (RAKE) in conjunction with some predefined rules13. Apart from analysing emerging trends within SemNet, the authors use changes in the graph to make predictions. To this end, they derive topological properties, such as node degree, and use them as input for a neural network (NN) to predict future connections. In a Kaggle challenge, the participants predicted changes in a SemNet built from the artificial intelligence (AI) literature14. While the most successful models combined specific hand-selected network features with ML techniques, such as NNs or graph NNs (GNNs), other participants employed purely theoretical or end-to-end ML approaches. However, the participants’ models could only use the structure of SemNet because the real meaning behind the nodes was not revealed in the challenge.

In this study, the information on materials science concepts is similarly compressed into a concept graph. Given the advances in language encoder models15,16, we use the MatSciBERT model17 to provide additional information on the concepts in the form of semantic embeddings to enrich the topological information of the nodes. Then, we explore how ML methods can use the time evolution of this representation of the literature to perform link predictions.

Recent advances have shown that graph-based approaches can accelerate discovery in materials science: SciAgents employs multi-agent graph reasoning, Graph-PRefLexOR integrates symbolic graph abstractions with LLMs and generative knowledge extraction with graph representations further supports hypothesis generation18,19. Complementary efforts on AI-driven ideation include SciMuse and SciMON, which use enriched co-occurrence and temporal knowledge graphs for idea generation, ResearchAgent, which iteratively refines literature-grounded ideas with knowledge-augmented LLMs, and SCI-IDEA, which applies context-aware embeddings for systematic ideation20,21,22,23. In contrast to approaches that analyse understanding, intelligence and creativity in general and try to evoke these in machines24, we aim to foster human creativity by using AI to help materials scientists propose new research directions by combining previously uncombined concepts. To explore the real-world applicability of our model and its suggestions, we conducted interviews with materials sciences researchers to assess how well the concepts generated and suggested by our model align with concepts from their own research.

Results

Concept extraction and concept graph

Using an LLM-based approach (Methods), approximately 510,000 chemical formulae and 3,600,000 concepts were extracted from the 221,000 abstracts in our database, which corresponds to an average of 2.3 chemical formulae and 16.3 concepts per abstract. The extracted concepts were then condensed into approximately 52,000 unique formulae and 1,241,000 unique concepts by removing duplicates. In general, our method resulted in more precise concept extraction than rule-based approaches (see Supplementary Note 1 for details). Due to the extraction capabilities of LLMs, the amount of manual annotation work needed to generate the initially required data is negligible, especially as our iterative approach (Fig. 1; Methods) reduces the manual effort to a minimum. Notably, the fine-tuned LLMs were able to extract concepts that were not present verbatim in the text. Table 1 shows selected examples to demonstrate the capabilities of the fine-tuned LLMs for nominalization, the removal of fill words such as ‘of’, plural-to-singular conversion and formatting corrections.

Table 1 Selected examples of abstracts and concepts extracted by our fine-tuned Llama-2-13B model
Fig. 1: Generation of labelled data.
Fig. 1: Generation of labelled data.
Full size image

Manual labelling (concept extraction) of 100 abstracts, fine-tuning of an LLM-base model on the annotated data, automatic concept extraction from 100 further abstracts with human correction, and then repeat fine-tuning of the base LLM with the new extended labelled dataset.

To construct a concept graph, we only include concepts that appeared at least three times and consist of at least two words. The resulting graph comprises approximately 137,000 nodes and 13,000,000 edges, making the calculation of topological features that require the squaring of the adjacency matrix possible. Supplementary Table 1 presents an overview of the 25 most frequently encountered concepts and formulas.

An analysis of the node degree distribution in the resulting graph shows that the majority of the nodes have a degree between 30 and 1,000 (Supplementary Fig. 8). While a few concept nodes act as hubs—having a notably larger number of connections than others—most of the concepts in the graph are directly linked to only a few others, making the resulting graph sparse. The evolution of the concept graph over time shows that connectivity further increases as more papers are being published using already existing concepts. We observe an increase in concept centralization, which is that fewer and fewer nodes account for a larger share of the total connections (Supplementary Fig. 9).

We visualize all concepts by projecting their high-dimensional concept embeddings to two dimensions using the uniform manifold approximation and projection (UMAP)25 technique with default settings. Figure 2 displays the result, which we call the ‘Map of materials science’ (an interactive version that can be explored on inspire.aimat.science). We then run nearest neighbour queries26 on the concept embeddings to explore whether these 768-dimensional vectors capture semantic meaning. The example queries listed in Supplementary Table 2 show the striking similarity between the queried concept and its nearest neighbours.

Fig. 2: Map of materials science.
Fig. 2: Map of materials science.
Full size image

Two-dimensional UMAP25 projection of all extracted concepts with the highest-degree concepts in each square of length 2 highlighted and annotated (‘Highest degree in tile’). Yellow and purple background colours respectively indicate high and low concept densities calculated using kernel density estimation.

Link prediction

To statistically assess the performance of our different link prediction models (see Methods for a detailed description), we evaluate their performance on a held-out test set for edge formation in the period between 2020 and 2022. The test set consisted of 2,000,000 node pairs, including 307 (0.015%) positive pairs, that is, emerging edges. We complement this with a qualitative analysis of the real-world applicability of the models based on human expert knowledge.

Figure 3a shows the receiver operator characteristic (ROC) curves for predicting link formation during the test period, as they illustrate the capacities of the model to distinguish between classes across all possible classification thresholds. ROC curves are particularly useful for imbalanced datasets because they evaluate performance independently of class distribution27. More information about test set creation, along with detailed results (Precision/Recall@k), can be found in Supplementary Note 3. Although the ‘Baseline’ model (a modified version of Krenn et al.14) performs slightly better (area under the curve (AUC) 0.9109) than the Concept Embeddings (MatSciBERT) model (AUC 0.8855), the performance of the latter already shows that our model architecture can use the semantic information contained in the concept embeddings. A GNN model based on the GraphSAGE architecture surpasses the Baseline model with an AUC of 0.9288, suggesting that while both models access the same input features, the GNN effectively leverages additional structural signals to improve performance. The Pure Text Baseline (implemented via a fine-tuned MatSciBERT17) also exploits this semantic information and performs similarly overall, but at a 5× higher inference cost; while overall performance was similar, worse results were achieved for emerging links between nodes which were previously connected through more than one intermediate node. Furthermore, the performance of the three hybrid models demonstrates that the link prediction task benefits from incorporating semantic knowledge on top of local graph features: While the Combination of features model already shows a slightly improved AUC of 0.9147, the Mixture of Baseline and Embeddings and Mixture of GNN and Embeddings models approach exhibits a substantial performance leap in the AUC metric, which reaches a maximum of 0.9433 when scaling the GNN and Concept embeddings model predictions by 0.5 and 0.5, respectively. We speculate that gradient descent optimization on a unified feature vector (concatenating the features of the Baseline and Concept embeddings model)—as it is done in the Combination of features model—might not be as effective as optimizing them individually. The distinct nature of baseline features versus high-dimensional concept embeddings could lead to the gradient for each batch becoming a suboptimal compromise between the gradients suited for each feature type in isolation. While MatSciBERT may under-represent emerging or interdisciplinary concepts, it still benefits from the base BERT knowledge, and tokenization ensures meaningful embeddings (Fig. 4; Methods). In our experiments, MatSciBERT (AUC 0.8855) outperformed BERT (AUC 0.8547), indicating an advantage of domain-specific embeddings, although BERT still offers a reasonable baseline.

Fig. 3: Performance metrics (ROC and the respective AUC) for our link prediction models on the test set (Ttest = [2020, 2022]).
Fig. 3: Performance metrics (ROC and the respective AUC) for our link prediction models on the test set (Ttest = [2020, 2022]).
Full size image

Markers highlight the performances at a threshold of 0.5. a, ROC curves on all data points with a zoomed-in view of the low-false-positive-rate region in the inset. b,c, The respective performance metrics for dprev = 2 (b) and dprev = 3 (c). Best result is in bold.

Fig. 4: Example of calculating concept embeddings from an abstract.
Fig. 4: Example of calculating concept embeddings from an abstract.
Full size image

Embeddings of verbatim concepts (‘mechanical stress’) are calculated by averaging all local MatSciBert embeddings of the corresponding tokens (4,487 and 1,893). Embeddings of non-verbatim concepts (‘nitride film’ is only present in the abstract in its unnormalized form ‘nitride films’) are calculated as the average of all token embeddings. x represents the embedding vectors of the tokens.

We further investigated the predicted new links with regard to the previous node distance dprev, that is the shortest path distance between two nodes in the time range of the test set Ttest = [2020, 2022] before they become directly connected. An analysis of the graph shows its dense interconnectedness as certain prevalent concepts in materials science, such as ‘mechanical property’ and ‘X ray diffraction’ (Supplementary Table 1), have edges with the majority of nodes, which leads to a large number of short distances between many concept pairs. Although the graph consists of 137,000 nodes, nearly all of the unlinked concept pairs in the test set were already connected through one (dprev = 2, 43.3%) or two (dprev = 3, 56.5%) concepts. The distribution of dprev is even more biased towards short paths for the positive samples, meaning the connections that actually formed during the test period. Of the 307 emerging edges (94.5%) in the test set, 290 were found to have dprev = 2, while only 17 (5.5%) had a previous distance of 3, which shows that the proximity of two nodes in the concept graph increases the probability for a new edge to form between them. Samples at dprev = 4 were scarce (0.2% of the total samples, all negatives) and therefore excluded from further analysis.

While the Baseline model tends to correctly predict emerging edges primarily at a distance of 2 (212 of 213 true positives have dprev = 2) with a recall of 73.1%, it performs much worse for dprev = 3 (recall 5.9%). By contrast, the Concept embeddings model achieves a significantly better recall of 35.3% (P < 0.05, DeLong test28 for dprev = 3 while only slightly compromising on the recall at dprev = 2 (70.0%). Notably, the GNN model matches this performance at dprev = 3, demonstrating that improved structural processing can rival the benefits of semantic embeddings for distant connections. The results are summarized in Extended Data Table 1. The high number of false positives, especially at dprev = 3 is not a problem in itself because those combinations may remain scientifically plausible and will subsequently be evaluated by human scientists. Hence, we prioritize recall over precision in order not to miss valuable ideas.

Note that optimizing the classification metrics by changing the classification threshold of 0.5 for a positive prediction is outside the scope of this work, which mainly aims at rating non-existing links with respect to their future emergence rather than accurately predicting whether a new link will form or not.

In addition, we also separately calculated the ROC curves and the corresponding AUCs for both dprev = 2 and dprev = 3, and the results (Fig. 3b,c) emphasize the Baseline model’s failure to correctly categorize most positives with dprev = 3. This not only highlights the inherent challenge of predicting positives at greater distances but also indicates that integrating semantic information enhances the model’s ability to forecast connections between concept pairs that are further apart in the graph. However, these emerging connections with larger previous node distances are particularly interesting and hold great potential to broaden the scientific scope beyond the more obvious new research directions. Ultimately, the Mixture of GNN and Embeddings yields the highest AUC for these distant connections, outperforming the individual models by effectively combining structural and semantic signals.

We also evaluated our Baseline model on the Science4Cast benchmark, where it achieved an area under the receiver operating characteristic of 0.9088, ranking second among all reported approaches by Krenn et al.14. This finding demonstrates that a deep NN trained on a large set of semantically meaningful features can outperform most competing methods, including those based on common neighbours and node2vec embeddings combined with a Transformer architecture6,29,30. As Science4Cast does not contain any semantic or text information about the meaning of nodes, we could not apply and benchmark our embedding-based models on the Science4Cast challenge.

Analysis of human expert evaluation

As the second part of the model performance analysis, we conducted interviews with ten materials scientists (human experts). Each expert received an individualized report containing recommended concept combinations suggested by our prediction model (Mixture of Baseline and Embeddings). The personalized recommendations were generated using the Mixture of Baseline and Embeddings model, which is marginally less performant (<0.01 AUC) than the Mixture of GNN and Embeddings model, because the GNNs were only tested at a later stage of our study. The suggestions were subsequently discussed in the interviews to assess and clarify the proposed concept combinations. The small sample size of the interviewees and a potential selection bias limit the robustness of our conclusions and allow only a qualitative analysis and anecdotal findings. Nonetheless, the expert feedback sheds light on the usefulness of the suggestions provided by our model.

Report generation

An overview of the report generation is shown in Extended Data Fig. 1. First, a set of individualized concepts Cown is generated as the intersection of (1) all concepts extracted from the abstracts of the recent publications of the respective researcher and (2) all known concepts Cknown in the concept graph. Based on these two sets of concepts, we generated researcher-specific suggestions, that is combinations of concepts: The first two report sections, Sown×own and Sown×other, contain the top 25 combinations of the own concepts with themselves and with all other concepts, respectively. We applied two heuristics (avoid generic concepts and avoid combinations that are too similar and too unrelated based on their semantic embeddings) to filter the suggestions in the second category, resulting in a section \({S}_{{\rm{own}}\times {\rm{other}}}^{{\rm{filtered}}}\). To take the researcher’s full profile into account, the next section S(many own)×other contains the top 20 concepts with highly scored connections to many own concepts. For the final section of the report (‘LLM curation’), LLMs were queried to select interesting combinations from the previous sets of combinations and to write a short paragraph with more information on how the concepts can be combined and why these specific combinations are promising new research directions. A technical definition of each section is given in Supplementary Note 4.

Classification of the suggestions

Based on the individual reports described above, a 30-minute interview was conducted with each researcher, in which the suggested combinations of concepts were classified as already known (A), nonsensical or not understandable (B) and novel, interesting or inspiring (C). In section 4 (S(many own)×other), suggestions were generously already counted as overall interesting (category C) if one of the many concepts in Cown was inspiring in conjunction with the proposed other concept. To account for cases in which the participants were unsure whether to label a suggestion as B or C, an additional category D was introduced in the analysis. For further analysis, the first class was divided into already published combinations (A1), which were likely missed during dataset generation (for example, very recent publications or publications outside of the analysed literature corpus) and obvious, trivial or very general combinations (A2), which are not necessarily mentioned together in an abstract.

Of 292 categorized suggestions, 71 suggestions were classified by the interviewees as already known (class A1), 36 as trivial (class A2), 99 as nonsense (class B), 77 as interesting (class C) and 9 as uncertain (class D); an overview of the categorized suggestion is given in Table 2. Thus, overall, the interviewees considered 26% of all suggested concept combinations interesting. An excerpt of combinations per category can be found in Supplementary Tables 7 and 8. As mentioned above, the number of interview partners is insufficient for a reliable statistical analysis; for example, the total number of classifications per researcher ranged between 18 and 48, with per-participant variance ranging from 5.04 to 51.36. An overview of the classified combinations per researcher and per-participant variance can be found in Supplementary Fig. 20 and Supplementary Table 6.

Table 2 Amount of suggestions categorized by researchers across all interviews, broken down by section of the report

To evaluate the usefulness of the ‘LLM Curation’ approach, we analysed how many of the combinations suggested by an LLM are later labelled as interesting. Supplementary Table 5 presents the confusion matrix of the variables ‘Suggested by an LLM’ and ‘Is Interesting’ (rated as C by a human expert). We observed a rounded precision of 47% regarding the LLM selecting interesting concepts; that is, of the 53 concepts suggested by the LLM as interesting, 24 were also labelled as interesting by the scientists. This is a substantial improvement in precision compared to 61 of 266 (23%) concept combinations analysed in total. In this context, the recall is not of primary interest as the LLM was only allowed to select a limited number of combinations as described above. An analysis of the previous node distances of the suggested combinations across all reports showed that 5 of 9 concept pairs at dprev = 3 were rated as category C. This high ratio of interesting combinations underpins our previous assumption that including semantic information in the prediction model increases its capability to foster out-of-the-box thinking.

In retrospect, groundbreaking ideas sometimes seemed absurd at first. Interviewees repeatedly categorized a combination as B (‘nonsense’) only to change their minds after some reconsideration or when seeing that suggestion again in the ‘LLM Curation’ section with an elaboration on how the combination could be realized. The exemplary paragraph often helped researchers to judge the usefulness of concept combinations. We speculate that the task of generating one’s own hypotheses on how concepts could be connected is inherently more difficult than judging an existing proposal.

Furthermore, many combinations could not be classified, especially in the sections ‘Other Concept Combinations’ and ‘Filtered Other Concept Combinations’. This is attributable to the vast number of concepts in Cknown, many of which the interviewees had never heard of. To allow researchers to navigate unknown suggestions, additional context, such as the original abstract, might prove helpful.

Examples of human expert evaluation

To illustrate the value of the suggested combinations of concepts in more detail, we discuss five selected examples of category C suggestions by putting them into context and describing why these combinations are interesting. A more detailed discussion of all five concept combinations can be found in Supplementary Information.

Suggestion 1 (‘conventional ceramic’ + ‘graphene oxide’)

This suggestion associates ‘conventional ceramics’ with ‘graphene oxide’, two domains seldomly combined. Conventional oxide ceramics provide chemical, thermal and structural stability. Graphene oxide offers a high-surface-area, electronically conductive carbon framework. Their union could yield composites marrying ceramic robustness with rapid charge and heat transport, relevant to batteries, catalysis and thermal barriers. Existing studies mix pre-synthesized oxides with graphene derivatives, giving limited interfacial contact. We show preliminary data of a ~200-nm iron oxide shell on multilayer graphene in Extended Data Fig. 2 (unpublished). The in situ process creates intimate oxide–graphene interfaces and a continuous conductive network. Electrochemical tests show a high reversible capacity and enhanced redox kinetics in Li-ion conversion cells. These findings demonstrate the model’s capacity to provide inspiration for overlooked but feasible synthesis strategies. Systematic exploration of AI-highlighted ceramic/graphene hybrids may accelerate multifunctional material discovery.

Suggestion 2 (‘tensile strain’ + ‘molecular architecture’)

Thin-film organic and perovskite solar cells comprise multi-layers with mismatched thermal expansion coefficients. Temperature excursions during coating, annealing and operation impose tensile strain at these organic–inorganic interfaces. Such strain drives delamination and point defect formation, accelerating performance loss. While strain engineering is routine in inorganic semiconductors, it is rarely applied in soft matter photovoltaics. Molecular architecture provides a complementary lever: Greater torsional flexibility lowers the film modulus and dissipates stress. Thus, the AI-proposed link ‘tensile strain + molecular architecture’ highlights an under-exploited stability pathway. In 2024, Brabec and Friederich showed that hole transport layers containing triphenylamine derivatives show an enhanced power conversion efficiency by accommodating strain31. These studies corroborate strain-aware molecular design as a broadly applicable interface strategy. Systematic exploration could extend device lifetimes without compromising performance.

Suggestion 3 (‘multiphase structure’ + ‘selective laser melting’)

Microstructure denotes all internal structural features—from lattice arrangement to point, line, planar and volumetric defects—across relevant length scales. These features dictate the mechanical and functional response and thus allow a rational selection of materials. A central attribute is the spatial distribution of phases, each possessing uniform crystal structure and composition. Technical alloys and ceramics are typically multiphase, generated through controlled thermo-mechanical processing. Phase topology manipulation enables simultaneous optimization of strength, toughness, corrosion resistance and functional properties. Selective laser melting (SLM) fabricates components by layer-wise laser melting of metal powders directly from digital models. The extreme heating-cooling rates inherent in SLM impose strong non-equilibrium solidification conditions. The resulting parts frequently exhibit metastable, compositionally heterogeneous multiphase microstructures. These structures can elevate hardness and corrosion resistance, yet they may also induce residual stresses. Therefore, elucidating phase-formation pathways during SLM represents a critical avenue for advanced materials design.

Suggestion 4 (‘stress-induced phase transformation’ + ‘hexagonal boron nitride’)

Stress-induced phase transformation toughening, exemplified by the tetragonal to monoclinic switch in zirconia, suppresses crack advance through local volume expansion. An alternative route uses elastic anisotropy; in pearlitic wires, load-parallel micro-cracks raise both strength and toughness. Applying these principles to boron nitride asks whether hexagonal BN (h-BN) can function as a transformation- or anisotropy-assisted toughener. Cubic BN (c-BN) is a dense, superhard phase, and pressure-driven c-BN to h-BN transitions could release crack-tip stresses. h-BN displays strong in-plane versus out-of-plane stiffness contrast, enabling guided micro-crack arrays akin to the pearlite mechanism. A coupled h-BN anisotropy and c-BN/h-BN transformation would thus offer simultaneous crack deflection and compressive shielding. Recent c-BN/h-BN composites show higher hardness and fracture energy, indicating technological promise. However, the role of a reversible transformation in these gains remains experimentally unverified. Targeted high-pressure mechanical tests with in situ diffraction are required to resolve transformation kinetics and toughening contributions. Establishing these links could generalize anisotropy-assisted transformation toughening to lightweight nitride coatings.

Suggestion 5 (‘in-plane polarization’ + ‘organic solar cell’)

Ferroelectric in-plane polarization, recently demonstrated in MAPbI3 perovskites, spatially separates photocarriers and channels them towards electrodes. The model’s suggestion indicates that similar lateral dipole fields could be engineered in organic absorbers. Asymmetric polar moieties, oriented during self-assembly or within covalent organic frameworks, may supply the required non-centrosymmetry. The resulting internal field should enhance carrier separation and transport, while a higher dielectric constant lowers exciton binding and monomolecular recombination. Ferroelectricity has so far been verified only for halide perovskites32,33,34, and is absent in silicon or conventional organic cells35,36. Piezoelectric polymers such as polyvinylidene fluoride already exploit oriented dipoles in sensing devices, suggesting viable processing routes. Earlier attempts to raise organic permittivity show limited success, and deliberate in-plane polarization remains unexplored37. Hence, the predicted concept defines a tractable, new direction for photovoltaic materials research.

Discussion

In the first part of this study, we showed that the power of LLMs, especially LLaMa-2-13B, can be harnessed to extract scientific concepts—vaguely defined as key phrases—from scientific texts. We established a methodology for fine-tuning open-source LLMs based on small manually labelled abstracts, which guides the LLM to extract only relevant concepts. The initial training data can be iteratively extended by humanly corrected LLM annotations to further improve the extraction process, but no human verification is required to check the final 221,000 labelled data points. Follow-up studies may investigate whether prioritizing quality over quantity38 in the annotated training examples, by using fewer but carefully selected data points, could yield more accurate and representative extracted concepts, and whether including synthetic data can help to accelerate the annotation process and further enhance model performance.

In addition, we created a concept graph, derived from the previously extracted materials science concepts and the dates of the corresponding publications. This graph was successfully used to predict emerging links between previously unconnected concepts, underscoring that a simple graph representation suffices for this task. Finally, we demonstrated that integrating semantic knowledge in the form of concept embeddings boosts the predictive performance of our model. Combining the GNN approach with semantic features is possible and will be explored in future work. We investigated the usefulness of our model in a real-world scenario through qualitative interviews with domain experts, who rated 77 out of 292 (26%) generated recommendations as interesting. While this rate may sound modest, each 30-minute session still yielded several promising ideas, making the outcome practical for guiding research.

In summary, we demonstrated that ML tools can be used to automatically process the vast amount of scientific literature and to predict future research directions that have not previously been explored to foster innovation and advancements. While this work focused on the material sciences as a use case, the developed approach can easily be extended to other research areas. By suggesting potential new research directions, we hope to drive innovation and collaboration in the field.

Methods

The key steps of our approach are depicted in Extended Data Fig. 3. After gathering the abstracts of a large number of research publications in the domain of materials sciences, we extracted the main concepts, that is short key phrases consisting of few words from these abstracts, and used them as nodes in a concept graph that mirrors the (time-dependent) connectivity of the materials science concepts in literature. In the final step of our workflow, we performed link prediction on this graph based on both network properties, for example, connectivity information and semantic knowledge about the concepts captured in aggregated word embeddings.

Dataset

We prepared a dataset of published papers related to materials science. Data were obtained from OpenAlex by querying all publications listed at materials science-related journals, conferences and other venues39. The retrieved papers were filtered based on language, length and whether they had an abstract. For each publication, the title and abstract were cleaned and concatenated. Chemical formulae were extracted, stored separately and later merged with the extracted concepts. The resulting dataset comprised approximately 221,000 articles published between 1955 and 2022, with relevant attributes being ‘title’, ‘abstract’ and ‘publication date’. A more detailed description of the dataset generation is given in Supplementary Note 7.

Concept extraction

Previous work used RAKE for concept extraction in conjunction with manual filtering to remove errors. These errors included phrases that did not represent semantic information and that were introduced by imperfections in RAKE’s statistical analysis12,13,14. Instead, we opted to extract concepts using fine-tuned LLMs (Fig. 1). To create a dataset for fine-tuning, 100 randomly chosen abstracts were first manually annotated by extracting and partially adjusting or even paraphrasing relevant and meaningful concepts as a preliminary step. Manual annotation is particularly sensitive to the labeler because there is no unique way of extracting and defining concepts. Subsequently, we fine-tuned the LLaMa-2-13B base model40,41 on our manually annotated abstracts for 4 epochs, using a learning rate of 5 × 10−4 and a batch size of 1 (Supplementary Note 8). Llama-2 models were state of the art when the work was performed in 2023, but future iterations of this work will use newer models. The size of the model is a trade-off between accuracy and cost, as it is the largest model that can process 20 abstracts at once on an A100 GPU with 80 GB of video random access memory. To accelerate training and especially inference, we incorporated 8-bit quantization42 and low-rank adaptation techniques43,44 using Hugging Face’s parameter-efficient fine-tuning module45.

Similar to Dunn et al.’s assisted annotation process4, the fine-tuned model’s outputs were compared in the third step to the concepts extracted using GPT-3.57 to efficiently identify and correct common mistakes made by the fine-tuned model. To do so, we labelled 100 additional abstracts, and the base model was again fine-tuned on a larger dataset of 200 abstracts. The process of iteratively adding more automatically extracted and manually corrected concepts to fine-tune the model with more data points could, in principle, be repeated more often, but 200 labelled abstracts were enough for our use case. Finally, the resulting model was employed to extract concepts from approximately 221,000 abstracts in our dataset, requiring approximately 160 GPU hours. Future updates of our concept graph would be substantially less demanding, as only incremental (delta) extraction is required. Future developments in LLM research might enable a complete re-evaluation with higher quality and reliability. After extraction, we conducted minor post-processing of the extracted concepts by removing the remaining plural forms. We note that some bias was introduced by the selection of 200 abstracts with 3,102 distinct concepts, as they were all from materials science, and observe that our method did not extract, for example, core-biology concepts.

Concept graph

The concepts extracted from the materials science literature in our dataset are represented in a multi-graph G = (V, E), where V and E are the respective sets of nodes and edges. In this concept graph, each node v V represents a distinct concept and each edge e E represents the co-occurrence of two concepts in a single abstract. Each edge is labelled with a timestamp t, indicating the publication date of the abstract containing both concepts, where Gt denotes a subgraph of G defined by the edge list that only includes all edges with time stamps ≤t. Therefore, every abstract generates a fully connected clique of its concepts in the concept graph. Multiple edges can exist between each pair of nodes if the concepts co-occurred in more than one abstract.

To enrich the nodes of our concept graph with semantic information, we calculated concept embeddings and used them as node features. Figure 4 summarizes the procedure of calculating concept embeddings using MatSciBERT17: First, both the entire abstract and the previously extracted concepts are tokenized and the embeddings are then calculated for the tokenized abstract. The next step consists of locating all instances of a concept in the abstract and averaging the embeddings of the tokens corresponding to the concept. For example, the concept ‘mechanical stress’ is tokenized as [4,487, 1,893], and its embedding is calculated as the average of the corresponding representations in the embedded abstract at the positions of the sequence [4,487, 1,893] in the tokenized abstract. To derive a singular representation for each concept per abstract, we average the embeddings of all its occurrences. In cases where a concept does not appear verbatim in an abstract—for example, due to the normalization processes during the initial concept extraction—we take the mean embedding of all tokens (while excluding the start and end token in the abstract as its representation). As the final step, we calculated the average embedding for identical concepts across different abstracts to obtain a single embedding for each concept and thus for each node. To prevent information leakage, embeddings used for training and testing were computed only from text available up to the corresponding cutoff year.

Link prediction

The previous method used by Krenn et al. to predict new links in their concept graph exclusively relied on abstract local graph properties, either through a purely graph-theoretical approach using hand-crafted features in conjunction with ML or through employing end-to-end ML methods12,14. Here we investigate whether integrating semantic knowledge about the concepts can improve link prediction. In particular, we use concept embeddings—that is, high-dimensional vectors that capture semantic information—to make the semantic information integrable into the link prediction task.

Given the concept graph G, we treat link prediction as a binary classification task. Thus, the objective of the ML model is to predict whether a new edge is formed between an arbitrary pair of previously unconnected vertices (u, v) in the time range T = [Tstart, Tend]. We chose Tstart,train = 2017 and Tend,train = 2019 for training, which means that our model had access to the entire data up to and including 2016 while its predictions were made for the years 2017, 2018 and 2019. We illustrate link prediction on a rudimentary concept graph in Extended Data Fig. 4.

The prediction task has an inherent strong label imbalance as the likelihood of a randomly selected vertex pair (u, v) forming a link throughout 3 years is extremely low. While there are, for example, 18.7 billion possible new edges that could form between 2017 and 2019, only 1.3 million new edges (0.007%) were observed during this period. To address this imbalance, we oversampled positive labels by using a fixed percentage (30%) of positive examples per batch in the training process. This oversampling during training shifts the trade-off between precision and recall in imbalanced tasks towards higher recall at the cost of losing precision, thus favouring the generation of larger sets of suggestions that may contain inspiring concept combinations over smaller sets in which some valuable ideas might not be included.

A modified version of Krenn et al.’s densely connected NN14, which relies purely on graph properties at different points in time, was used as the Baseline model. Specifically, the degree of a node u (\({\sum }_{i=1}^{n}{A}_{u,i}\)) and the sum of all 2-length paths from u (\({\sum }_{i=1}^{n}{A}_{u,i}^{2}\)) were calculated for different years in the range t = [Tstart,train − 5, Tstart,train − 1], where At denotes the binary adjacency matrix of Gt. These features were then concatenated for a given pair of nodes (u, v) to result in a 20-dimensional baseline feature vector. In the second Concept embeddings (MatSciBERT) model, the concatenated concept embeddings of u and v were used instead as the (1,536-dimensional) feature vector for the NN classifier, to test their information content and relevance for the link prediction task. We repeat the embedding generation process with BERT, yielding a modified Concept embeddings (BERT) model. To explore another way of utilizing semantic information, we fine-tuned the MatSciBERT model directly to predict the likelihood of two concepts becoming connected in our concept graph, thus yielding the Pure Text Baseline model.

The Baseline model was then combined with the Concept embeddings model in two hybrid models, the first of which (Combination of features) used a concatenation of the feature vectors from the two previous models as the input. The hyperparameters of all NNs were optimized using a comprehensive grid search varying the number of neurons per layer, the percentage of positive samples in each batch, the learning rate and the dropout probability (see Supplementary Table 9 for a list of optimized hyperparameters).

The second hybrid model (Mixture of Baseline and Embeddings) uses a weighted output of the optimized Baseline and Concept embeddings (MatSciBERT) models, where an optimal weighting of 3:2 was determined using hyperparameter optimization. We acknowledge many other possible hybrid models exist aside from concatenating the two input vectors and calculating the weighted average of the output probabilities, as the two parts of the input could be passed through a first set of layers separately before the two outputs are concatenated and passed through a second set of layers. However, optimizing the architecture of the NN was outside the scope of this study, as our goal mainly consisted of showing that including the concept embeddings improves link prediction.

To explicitly capture local neighbourhood structures through message passing, we implemented a GNN model. Given the large-scale and hub-heavy nature of the network, we employed neighbour sampling to enable efficient training, initializing the node representations with the topological vectors from the Baseline model. This architecture utilizes a 2-layer GraphSAGE encoder46 with neighbour sampling to compute node embeddings based on the graph topology at Tstart,train. A multilayer perceptron decoder was employed to classify the concatenated node embeddings.

Finally, we constructed a Mixture of GNN and Embeddings model as a third hybrid model, analogous to the Mixture of Baseline and Embeddings approach described above. This ensemble calculates a weighted average (1:1) of the output probabilities from the GNN and the Concept embeddings (MatSciBERT) models.

To avoid overfitting, we monitored the AUC—a summary metric we derive from the ROC—on a potentially out-of-distribution validation set with Tstart,validation = 2020 and Tend,validation = 2022.

Human domain experts

The human domain experts who participated in the interviews were affiliated with different institutes at different institutions and were recruited to cover a wide range of topics within materials science. Of the 13 researchers who were invited, 10 agreed to participate in the study and the interviews. All participants were professors or independent group leaders. No interviews were excluded from the study.