Introduction

In recent years, various methods that leverage large language models (LLMs)1,2,3 have been actively studied and developed to generate research ideas4,5,6,7, solutions8, and hypotheses9,10,11 for specific problems. To enable problem-solving based on specialized knowledge, approaches such as fine-tuning with domain-specific datasets12,13 and utilizing retrieval-augmented generation (RAG)8,14,15 have been explored to enhance expertise. Furthermore, research has explored enhancing the diversity and feasibility of generated solutions by facilitating interactions between multiple expert models with domain-specific knowledge6,16,17. LLM-based approaches have also been applied in materials science and materials development18,19. In addition, there are reports on the use of knowledge graphs to harness networks of specialized knowledge for generating research ideas and hypotheses5,20. More recently, studies have demonstrated the ability of these techniques to draft entire research papers21. These examples demonstrate that the use of LLMs to generate solutions for a wide range of problems has already been an active area of research.

However, the development of advanced methodologies for generating effective solutions to intrinsically challenging problems remains in its early stages. “Intrinsically challenging problems” refer to issues that require the integration of knowledge and methodology from seemingly unrelated fields to effective solutions. Prominent examples of such problems and their solutions include CRISPR-Cas922 for genetic engineering (Nobel Prize 2020)23, Green Fluorescent Protein24 for live-cell imaging (Nobel Prize 2008)25, and functional Magnetic Resonance Imaging for brain-activity mapping26,27,28. The application of indium–gallium–zinc–oxide (IGZO) technology, from flat-panel displays to Dynamic Random Access Memory29,30,31,32, which is treated as a target problem in this study, serves as a representative example of such challenges. Addressing these problems is difficult with existing methodologies because identifying the appropriate domain of expertise distant from the target problem is inherently challenging. These problems may arise not only when the relevant domain of expertise is objectively distant from the target problem, but also when it is perceived as distant due to researchers’ or disciplines’ preconceived notions. Although there have been studies using LLMs for research hypothesis generation4,5,6,7, generating effective solutions for interdisciplinary problems also remains. This is because, as noted in a previous study4, the generated hypotheses often lack diversity, and there has been little effort to incorporate a broad range of knowledge that is unrelated to the target hypothesis theme. As a potential approach applicable to these problems, some studies have proposed leveraging knowledge from fields unrelated to the target problem15,20. These methods provide LLM-based systems with domain knowledge or combinable information using tools such as knowledge graphs or retrieval systems. However, when combining knowledge through such systems, there remain concerns that important knowledge may not be selected depending on how the information is retrieved, or that the pre-stored knowledge may not include what is truly necessary to solve the problem. Therefore, the challenge of missing critical knowledge or fields essential for problem-solving remains inadequately addressed.

In this study, a brute-force approach called Solution Enumeration via Comprehensive List and LLMs (SELLM) is proposed to generate valuable solutions to inherently challenging problems using LLMs. The core principle of SELLM is to construct a comprehensive set of domain-specific “experts” using LLMs, with each expert providing specialized solutions from the perspective of its respective field. By establishing a comprehensive set of experts with appropriate granularity, it becomes possible to generate effective solutions that integrate seemingly unrelated knowledge and methodologies without omissions. To achieve this, SELLM utilizes structured lists based on Mutually Exclusive, Collectively Exhaustive (MECE) principles, such as the International Patent Classification (IPC) system or periodic table of elements, to generate solutions that incorporate domain-specific expertise. The main advantage of this approach is that it transforms domain coverage into a structured enumeration task, enabling systematic solution generation without relying on model fine-tuning or complex retrieval architectures. Additionally, SELLM can be easily applied to different problem types by changing only the underlying list structure.

To evaluate the effectiveness of the proposed method, SELLM was applied to two challenging problems: improving the light extraction in organic light-emitting diode (OLED) lighting33,34 and developing electrodes for next-generation memory materials32. The former problem involved the inefficient light extraction from OLED lighting devices, where the high refractive indices of organic layers and transparent electrodes limited external light emission to around 20%. It was addressed by combining glass-layer formation techniques, incorporating lens glass materials from optical lenses and glass frit paste used for front dielectric layers in plasma televisions. The latter problem involved the contact resistance in IGZO-based thin-film transistors (IGZO-TFTs), which hinders their practical application in high-speed, high-capacity memory devices due to mobility degradation and increased power consumption. This issue was addressed by employing palladium, a catalytic metal with high hydrogen permeability and the ability to dissociate hydrogen molecules, as a contact material. Although the solutions to these problems may seem straightforward once presented, their development requires significant expertise across different fields. This complexity makes these problems inherently challenging. The results demonstrated that SELLM, by utilizing IPC subclass lists and chemical element lists, facilitates the generation of effective solutions compared with cases with no specific customization or effort. These findings highlight the potential of effectively leveraging LLMs to propose essential solutions by integrating knowledge from seemingly unrelated fields. Although SELLM has the limitation of producing redundant or less relevant ideas due to its comprehensive nature, it has the potential to generate interdisciplinary solutions for challenging problems.

Results and discussion

Overview of the proposed method and evaluation strategy

Figure 1 illustrates the framework of SELLM. The input consisted of a description of the problem to be solved and a list of knowledge elements, each accompanied by an explanation. The list was expected to maintain an appropriate level of granularity suitable for its field of expertise while ensuring sufficiently broad coverage. In addition, minimizing overlap between elements helped reduce unnecessary redundancy. Subsequently, an expert was generated for each element using the technique of role-play prompting35 from the provided list. Then, these experts generated solutions to the given problem based on their specialized knowledge. Further details are provided in “Methods”.

Fig. 1: Overview of the proposed method (SELLM) and its evaluation.
figure 1

The framework of SELLM takes as input a problem statement and a structured list of knowledge elements with explanations. Role-playing prompting is used to generate domain-specific experts for each element, which then provide candidate solutions to the problem. The generated solutions are subsequently assessed using similarity-based evaluation (SBE), keyword-based evaluation (KBE), and human-based evaluation (HBE).

Three evaluation methods were employed to assess the appropriateness of the generated solutions: similarity-, keyword-, and human-based evaluation (SBE, KBE, and HBE, respectively). In SBE, leveraging the LLM-as-a-Judge36 approach, the similarity between the generated and reference solutions described in the literature was rated on a scale from 1 to 10. For KBE, key terms essential to problem-solving were scored by counting their occurrence in the solutions. These key terms were chosen from existing literature, focusing on those related to domains different from the target problem and relevant to effective problem-solving. In HBE, solutions with diverse SBE scores were further evaluated by human experts. This evaluation aimed to determine whether the solutions constituted genuinely effective solutions, irrespective of their alignment with the reference solution. The criterion of HBE scores was whether the solution effectively addressed the challenge by leveraging knowledge from different fields. The evaluators assigned HBE scores while being blinded to the SBE scores and also provided justifications for their evaluations. Further details of SBE, KBE, and HBE scoring methods are described in “Methods”.

Here, we briefly introduce the two challenging problems addressed in this study and their respective solutions. The first problem is the efficient extraction of light from OLED devices used for lighting. In the 2010s, a major issue was the high refractive indices (ranging from 1.7 to 2.0) of the organic layers and transparent electrodes in organic EL devices, resulting in only approximately 20% of the emitted light being externally extracted (Fig. 2a)37,38,39. Despite significant efforts by industries and academia to resolve this issue, developing a low-cost and stable manufacturing method proved challenging40,41,42. Eventually, an solution was proposed by combining glass-layer formation techniques, using glass materials for optical lenses and glass frit paste for front dielectric layers in plasma televisions (Fig. 2b)33. This solution is known as the KIller technology for Waveguide and Interference of OLED light (KIWI) technology. As patents and publications regarding this solution are publicly available, recent pre-trained models were trained on this information. However, as demonstrated in “Generating solutions to the light-extraction problem in OLED lighting” section, generating essential solutions without specific guidance remains challenging.

Fig. 2: An overview of the challenging problems and solutions addressed in this study.
figure 2

a Efficient light extraction from OLED devices for lighting was hindered by the high refractive indices of organic layers and transparent electrodes. b This challenge was addressed using KIWI technology, which combines glass-layer formation techniques involving lens glass materials from optical lenses and glass frit paste for forming front dielectric layers in plasma televisions. c Overview of the contact resistance issue in indium–gallium–zinc–oxide-based thin-film transistors (IGZO-TFTs) and its solution using palladium electrodes. Adapted with permission from Shi et al.32 under the terms of the Creative Commons Attribution License 4.0 (CC BY).

The second problem addressed the contact-resistance issue in IGZO-based thin-film transistors (IGZO-TFTs), which are promising candidates for next-generation memory materials capable of achieving high speed and large capacity. These devices use IGZO, an amorphous oxide semiconductor, as the primary material (Fig. 2c)32. The practical application of IGZO in memory devices requires the implementation of fine wiring and electrodes. However, challenges such as mobility reduction and increased power consumption owing to contact resistance have emerged. Various approaches have been proposed to solve these issues43,44,45. Recently, a solution was found using palladium, a catalytic metal with high hydrogen permeability and the ability to dissociate hydrogen molecules32. This solution, detailed in a 2024 paper, is unlikely to have been included in the training data of the pre-trained models used in this study.

Generating solutions to the light-extraction problem in OLED lighting

We conducted experiments to verify the effectiveness of SELLM using the two case studies. As the first case, As the first case, we evaluated SELLM’s capability to generate solutions for the light extraction problem. The problem statement, reference solution, and keywords are provided in Table 1, along with a list of subclasses from IPC sections B, C, F, G, and H. For comparison, solutions were generated using the same problem statement without specific adjustments (standard approach). To balance between performance and cost, OpenAI’s GPT-4o-2024-08-06 (GPT-4o) was used as the representative LLM to generate experts and solutions. Details of the explanatory descriptions and generation conditions are provided in “Methods”.

Table 1 Problem statement, reference solution, and keywords for the light-extraction problem

Figures 3a and 3b show the distributions of evaluation scores for SBE and KBE, respectively. Figure 3a demonstrated that while the standard approach yielded solutions with a maximum score of 6, SELLM successfully generated solutions with higher scores of 8 and 9, which were closer to the reference solution. This was particularly intriguing given that KIWI technology was publicly available and likely included in the LLM training data, yet the standard approach failed to generate viable solutions without specific adjustments. Similarly, Fig. 3b showed that the standard approach produced solutions with a maximum KBE score of only 1, whereas SELLM generated solutions with a KBE score of 2. This result indicates that the some solutions generated by SELLM contain key terms that are important for problem-solving and thus considered promising. Examples of high-scoring solutions for both SBE and KBE are presented in Table 2.

Fig. 3: Evaluation of solutions generated using Solution Enumeration via a Comprehensive List and Large Language Models (SELLM) for the light-extraction problem.
figure 3

a Distribution of the SBE scores for solutions generated by SELLM and the Standard approach, presented on a logarithmic probability scale. b Distribution of the KBE evaluation scores, also presented on a logarithmic probability scale. c Relationship between the SBE and HBE scores for a subset of generated solutions evaluated using HBE.

Table 2 Examples of solutions generated by SELLM with high SBE and KBE scores for the light-extraction problem

Next, an evaluation using HBE was conducted. Among the solutions generated by SELLM, three were selected for each SBE score, ranging from 1 to 10, and were subsequently evaluated using HBE. Supplementary Table S1 lists the evaluated solutions and their corresponding HBE scores. Figure 3c illustrates the relationship between the SBE and HBE scores. The Pearson correlation coefficient between the SBE and HBE scores was approximately 0.84, while the Cohen’s kappa coefficient was 0.125. Despite the high correlation coefficient, the kappa coefficient was relatively low, which is thought to be due to the adoption of 10 grading levels in the scoring criteria. It should also be noted that SBE and HBE are based on different evaluation criteria. Overall, a correlation between the SBE and HBE scores was observed, suggesting that the SBE, as assessed by the LLM, reasonably reflected the effectiveness of the solutions. As highlighted in Table 3, the SELLM approach successfully generated highly accurate solutions with an HBE score of 9. For example, the second solution shown in Table 3 has an SBE score of 9 but an HBE score of 7, as it proposes a promising solution but includes unnecessary steps. Conversely, the third solution demonstrates an example where the HBE score exceeds the SBE score, achieving an HBE score of 9. These results indicated that the solutions generated by SELLM were generally deemed highly valid by human experts, and that the evaluation strategy employed was effective.

Table 3 Examples of solutions with relatively high HBE scores for the light-extraction problem

Generating solutions to the problem of contact resistance in IGZO-TFTs

Figure 4 illustrates the solution generation results for the IGZO-TFTs challenge using SELLM. The problem statement is provided in Table 4, and the lists used include subclasses from the IPC classes B, C, F, G, and H, as well as a list of 83 elements ranging from hydrogen to bismuth. GPT-4o was also used to generate experts and solutions, considering the balance between performance and cost. Figures 4a and 4b show the distributions of the evaluation scores for SBE and KBE, respectively. The reference solutions for the SBE scores and keywords for the KBE are listed in Table 4. From Fig. 4a, it was evident that the maximum SBE score for the standard approach was 4, whereas SELLM successfully generated solutions with significantly higher scores. Moreover, the element list tended to produce higher-scoring solutions than the IPC subclass list. Similarly, Fig. 4b indicates that SELLM generates solutions with higher KBE scores, further suggesting that the element list is more effective than the IPC subclass list, as shown by the SBE results. This difference can be explained by the fact that palladium (Pd), which was essential in the reference solution, was explicitly included in the element list and effectively used by the Pd-specific expert in SELLM. In contrast, Pd-related technologies were only sparsely covered in the IPC subclasses, which may have reduced their influence on the generated solutions. Examples of the solutions generated with high SBE and KBE scores are listed in Table 5. These results indicated that SELLM effectively addressed the contact resistance problem in IGZO-TFTs, as it generated solutions closely aligned with the reference solution.

Fig. 4: Evaluation of solutions generated using SELLM for the problem of contact resistance in IGZO-TFTs.
figure 4

a Distribution of the SBE scores for solutions generated by SELLM and the Standard approach, presented on a logarithmic probability scale. b Distribution of the KBE evaluation scores, also presented on a logarithmic probability scale. c Relationship between the SBE and HBE scores for a subset of generated solutions evaluated using HBE.

Table 4 Problem statement, reference solution, and keywords for the problem of contact resistance in IGZO-TFTs
Table 5 Examples of solutions generated by SELLM with high SBE and KBE scores for the problem contact resistance in IGZO-TFTs

Next, an evaluation using HBE was conducted. As in the light-extraction case, three solutions were selected for each SBE score from 1 to 10 among those generated by SELLM, and these were evaluated using HBE. Supplementary Table S2 presents the evaluated solutions and their corresponding HBE scores. Figure 4c shows the relationship between the SBE and HBE scores. Similar to the light-extraction problem, a correlation between the SBE and HBE scores was observed overall. The correlation coefficient between the SBE and HBE scores was approximately 0.72, and the Cohen’s kappa coefficient was 0.125. As highlighted in Table 6, SELLM successfully generated highly accurate solutions with an HBE score of 9. Interestingly, there were solutions with a relatively low SBE score of 3 but a comparatively high HBE score of 7. This was particularly noteworthy as it suggested an alternative solution using platinum, which, similar to palladium, could exhibit hydrogen activation properties, although its hydrogen storage and transport capabilities are low. These results demonstrate that SELLM can generate highly valid solutions. It can also propose feasible alternatives that have not been documented in existing research.

Table 6 Examples of solutions with high HBE scores or low SBE scores but relatively high HBE scores for the problem of contact resistance in IGZO-TFTs

Generating solutions with SELLM using open-weight models

To investigate whether SELLM can be effectively utilized with lower-cost models than GPT-4o, we conducted experiments using two open-weight models, Meta’s Llama 3.3 70B46 and DeepSeek V347, both known for their low cost and stable performance, via OpenRouter48. From the experimental settings used in sections “Generating solutions to the light-extraction problem in OLED lighting” and “Generating solutions to the problem of contact resistance in IGZO-TFTs”, we modified only the solution-generation model to use the open-weight alternatives. We also collected results using the standard approach, where solutions were generated without any adjustments. The generated solutions were evaluated using both SBE and KBE. For consistency across experiments, GPT-4o was used as the evaluation model in SBE.

Figures 5a and 5b show the distributions of SBE and KBE scores obtained from the experiments conducted under the same solution-generation conditions for the light-extraction problem in OLED lighting as in “Generating solutions to the light-extraction problem in OLED lighting” section. In the Standard approach, highly scored solutions were not also generated using DeepSeekV3 and Llama 3.3 70B. From Fig. 5a, it can be observed that DeepSeek V3 frequently generated solutions evaluated as being close to the reference solution at a frequency comparable to or even higher than GPT-4o. Although less frequently, Llama 3.3 also produced solutions with high SBE scores, such as 9 or 8 points. This indicates that, in the case of the light-extraction problem, effective solutions that could not be generated using the standard approach were successfully produced using SELLM, even with open-weight models. Similarly, Fig. 5b supports the same conclusion: although less frequently than GPT-4o, both DeepSeek V3 and Llama 3.3 70B were able to generate solutions that received a score of 2 in the KBE evaluation. Examples of the solutions generated with high SBE and KBE scores are listed in Table 7.

Fig. 5: Evaluation of solutions generated using SELLM for the light-extraction problem with open-weight models.
figure 5

a Distribution of the SBE scores for solutions generated by SELLM and the Standard approach, presented on a logarithmic probability scale. b Distribution of the KBE evaluation scores, also presented on a logarithmic probability scale.

Table 7 Examples of solutions generated by SELLM with high SBE and KBE scores for the light-extraction problem with open-weight models

Figures 6 and 7 show the distributions of SBE and KBE scores from experiments conducted under the same conditions for the problem of contact resistance in IGZO-TFTs as in “Generating solutions to the problem of contact resistance in IGZO-TFTs” section. Figure 6 shows the results when the IPC subclass list was used to construct expert lists, while Fig. 7 presents results when the element list was used instead. Similar to the light-extraction case, both DeepSeek V3 and Llama 3.3 70B generated solutions with high SBE and KBE scores. Examples of the solutions generated with high SBE and KBE scores are listed in Table 8. These findings suggest that even when using the open-weight models with SELLM, effective solution generation is achievable, though the frequency may vary.

Fig. 6: Evaluation of solutions generated using SELLM for the problem of contact resistance in IGZO-TFTs with open-weight models using the IPC List.
figure 6

a Distribution of the SBE scores for solutions generated by SELLM and the Standard approach, presented on a logarithmic probability scale. b Distribution of the KBE evaluation scores, also presented on a logarithmic probability scale.

Fig. 7: Evaluation of solutions generated using SELLM for the problem of contact resistance in IGZO-TFTs with open-weight models using the element List.
figure 7

a Distribution of the SBE scores for solutions generated by SELLM and the Standard approach, presented on a logarithmic probability scale. b Distribution of the KBE evaluation scores, also presented on a logarithmic probability scale.

Table 8 Examples of solutions generated by SELLM with high SBE and KBE scores for the contact resistance problem with open-weight models

Effect of temperature on solution generation with SELLM

To investigate the effects of parameter settings, we conducted experiments using SELLM with different temperature values. Based on the experimental conditions described in “Generating solutions to the light-extraction problem in OLED lighting” section, we modified only the temperature parameter of the solution-generation model, setting it to 0.5 and 0.9, respectively. All other settings, including the evaluation models and parameters, remained unchanged. Additionally, as in “Generating solutions to the light-extraction problem in OLED lighting” section, we collected results under the standard condition with varying temperature settings. The generated solutions were evaluated using both SBE and KBE.

Figure 8 shows the distributions of SBE and KBE scores of the generated solutions with different temperature settings for the light-extraction problem. From Fig. 8, it is evident that even when the temperature is changed, the standard approach still fails to generate high-scoring solutions. Figure 8a further indicates that the highest proportion of SBE 9 solutions was generated at a temperature of 0.7, while at a temperature of 0.9, it was more difficult to generate solutions scoring 8 or 9 in SBE compared to when the temperature was set to 0.7. In Fig. 8b, we observe that there is little difference in the KBE score distribution across different temperature settings. These results suggest that although the temperature parameter affects the generated solutions, the default value of 0.7 is adequate for producing effective solutions.

Fig. 8: Evaluation of solutions generated using SELLM for the light-extraction problem under different temperature settings.
figure 8

a Distribution of the SBE scores for solutions generated by SELLM and the Standard approach, presented on a logarithmic probability scale. b Distribution of the KBE evaluation scores, also presented on a logarithmic probability scale.

Discussion of generating solutions using SELLM

This study proposed SELLM as a brute-force approach for generating solutions to inherently challenging problems using LLMs. One of the main advantages of SELLM is its ability to produce diverse and comprehensive solutions by constructing structured expert lists based on the MECE principle. By covering a wide range of fields, SELLM allows LLMs to explore solution spaces that may otherwise be overlooked, especially in interdisciplinary contexts. The case studies demonstrated that SELLM could generate not only valid solutions, but also unreported and unexpected ones that combined knowledge from seemingly unrelated domains.

While previous studies have demonstrated the utility of fine-tuning1,12,13 and RAG8,14 to enhance domain expertise, SELLM takes a different direction. Previous methods typically focus on deepening expertise within a single domain or enhancing retrieval accuracy. In contrast, SELLM aims to broaden the search space systematically. Unlike earlier approaches that focused on individual domains or limited interactions between expert models6,16, SELLM generated exhaustive and diverse solutions that connected seemingly unrelated fields. This study complements and extends the recent advancements in LLM-based problem-solving18,20 by addressing the limitations of generating solutions for intrinsically challenging problems requiring the integration of distant knowledge.

Although SELLM showed promising results, a limitation of SELLM was its tendency to generate redundant or irrelevant ideas because of the exhaustive nature of its approach. While this redundancy may have been an inherent aspect of fostering innovative ideas, it presented a significant challenge for practical applications. As an example of a method to exclude low-quality solutions, we conducted a preliminary attempt in which the LLM itself evaluated the feasibility of generated solutions (Supplementary Fig. S1). As a result, approximately 30% of low-quality solutions could be filtered out. However, this evaluation approach remains in a preliminary stage and still requires further refinement. When SELLM generates 100 solutions, the current filter can only exclude 30 solutions, while the remaining 70 solutions still require manual selection. Representative examples of solutions assigned low feasibility scores are shown in Supplementary Table S3 for future research. Possible improvements include enhancing the generative model itself through reinforcement learning from human feedback49 or applying stricter evaluation methods, such as chain-of-thought prompting50. In addition, hallucination reduction techniques51 would be effective for finding poor-quality solutions.

From the perspective of hypothesis construction, this study aligned with Charles S. Peirce’s concept of abduction, also referred to as retroduction52,53,54. Abduction, as defined by Peirce, is the process of forming plausible hypotheses to explain or address phenomena. Our results demonstrated that by providing structured guidance, SELLM could generate numerous hypotheses, including practical solutions, which align with Peirce’s framework of abduction. Nevertheless, as Peirce emphasized, abduction involves not only the generation of hypotheses but also their refinement to identify the most valuable ones from a broad pool of candidates. As discussed earlier, scoring and narrowing down generated solutions is a major challenge for future research.

Conclusion

In this study, SELLM, a framework designed to generate solutions for intrinsically challenging problems using LLMs, was developed and evaluated. By employing structured lists, such as IPC subclasses and chemical elements, SELLM created domain-specific experts capable of addressing problems that required the integration of knowledge from diverse fields. Its effectiveness was demonstrated through two complex challenges: improving light extraction in OLED lighting and addressing contact resistance in IGZO-TFTs. The results demonstrated that SELLM generated solutions with higher SBE and KBE scores compared with those generated without specific adjustments. Furthermore, expert evaluations confirmed that SELLM produced valuable solutions.   Indeed, such solutions were both effective and cross-disciplinary solutions generated by experts in fields unrelated to the target problem. These findings suggested SELLM could address inherently challenging problems by comprehensively and systematically integrating domain knowledge. Moreover, the results suggested that high-quality specialized solutions could be generated by leveraging the knowledge already stored within LLMs, without the need for external domain-specific knowledge from complementing LLMs such as RAG or fine-tuning. Nevertheless, incorporating them or their hybrid55 could further improve the quality of generated solutions and reduce the cost of generation, especially when applied to smaller models. Even if the generated solutions were imperfect, presenting a list of possible solutions could inspire human users, foster advanced ideas, and enable the refined or further development of incomplete concepts.

A key advantage of SELLM is its ability to control solution generation by selecting appropriate lists, such as company-specific technologies or laboratory resources. This flexibility allowed for tailored applications across industries and research domains. For example, SELLM could be used to identify potential uses for proprietary materials, suggest applications for underutilized technologies, or explore unexpected combinations of available resources. In addition, integrating advanced filtering mechanisms and interactive feedback loops with human experts could further enhance the utility of SELLM. Future research should focus on developing efficient methods for evaluating and ranking the generated solutions, reducing noise while preserving the breath of innovation. In addition, by integrating SELLM into autonomous research frameworks such as “The AI Scientist”21 or “Agent Laboratory”56, it is expected that the generated solutions could be further refined and utilized, thereby enhancing the overall capabilities of such agents.

Methods

Lists of specific knowledge and their explanations

SELLM generated a list of experts from a set of terms consisting of concepts and domain knowledge, and outputted solutions using from the expert list. The list of knowledge and concepts must be sufficiently comprehensive and granular to ensure the inclusion of concepts that contribute to solution generation. In addition, excessive overlap between concepts within a list could lead to redundant and similar responses. Therefore, employing an MECE list was preferable whenever feasible.

We used a list of subclasses from the IPC and a list of chemical elements. The IPC served as a representative system for structurally organizing technologies across various fields, providing a sufficiently comprehensive and nonoverlapping list of technologies. To generate technically diverse and effective solutions to a target problem, we selected the IPC list, which is a large-scale and MECE classification system for technologies. The patent classification consists of sections A–H. While all sections and subclasses could be used, sections B (performing operations; transporting), C (chemistry; metallurgy), F (mechanical engineering; lighting; heating; weapons; blasting), G (physics), and H (electricity) were selected to balance the computational cost. The number of subclasses in sections B, C, F, G, and H were 170, 87, 99, 87, and 54, respectively, totaling 497 subclasses. Additionally, we selected the list of chemical elements, considered both fundamental and important not only in technological but also in scientific contexts. The element list is also MECE. For the list of chemical elements, 83 element symbols, ranging from hydrogen (element 1) to bismuth (element 83), were included, as they were deemed practically applicable under standard conditions.

To ensure the accurate generation of experts reflecting the concepts in the list, descriptive texts were prepared for each concept. For the IPC, descriptions were generated based on subclass symbols, including subclass titles, references, and the symbols and titles of the main groups and subgroups associated with each subclass, thereby providing explanations of the technologies relevant to each subclass. Similarly, descriptive texts were generated for each element. These descriptive texts were generated using GPT-4o, with the prompts used for their creation listed in Supplementary Table 4.

Generation of experts and solutions

From the prepared comprehensive list, domain-specific expert LLMs were systematically created for each technological field. Specifically, expert LLMs were generated by instructing a solution-generating LLM to adopt expert roles through role-play prompts, as outlined in Supplementary Table S5. Each expert LLM was tasked with generating ten solutions. To ensure a stable evaluation, the creation of experts and generation of solutions were repeated five times. Owing to the large number of experts required for solution generation, the process incurred significant computational costs and time. To address this, for the light-extraction challenges and IGZO-TFTs issues, a relatively cost-efficient and faster GPT-4o model was used for both expert creation and solution generation. The specific monetary costs incurred for solution generation and evaluation processes using SELLM is provided in Supplementary Table S6. Under the current experimental setup, a single round of generation over the list cost approximately $15. This cost depends on the type of LLM model used, as well as the list employed in SELLM and the problems described. GPT-4o was accessed via its API using the default parameters, including a temperature setting of 0.7.

Evaluations for generated solutions

The generated solutions were evaluated using the following three metrics: SBE, KBE, and HBE. The SBE was evaluated by comparing the generated solutions to reference descriptions of solutions for light extraction and IGZO-TFTs challenges, as provided in Tables 1 and 2. The similarity was assessed based on LLM-as-a-Judge prompting, as detailed in Supplementary Table S7. GPT-4o was used to evaluate similarity scores throughout this study. To improve scoring accuracy, 20 sample solutions were generated for each task and evaluated by human experts. The scores and reasons assigned by the experts were used as reference data for the evaluation process. The scoring for the light-extraction problem is shown in Supplementary Table S8, and for the problem of contact resistance in IGZO-TFTs, it is provided in Supplementary Table S9.

The KBE focused on the presence of keywords deemed critical for addressing these challenges. A list of important keywords for each challenge is provided in Tables 1 and 2. For the evaluation, two groups of keywords were defined for each challenge. The generated solutions were verified to determine whether they contained at least one keyword from each group. A KBE score of 1 or 2 indicated that the solution contained keywords from one or both groups, respectively.

For the HBE, three solutions were selected from each SBE metric value range and evaluated by two of the authors, who are experts in materials science. While a larger number of independent evaluators would be desirable to reduce bias in evaluating the generated solutions, it was difficult to recruit experts capable of accurately assessing both the technical validity and the innovativeness of the solutions at the time of the study. Thus, the evaluation in this study was conducted by the two domain experts. The evaluators provided a score and justification for each solution (Supplementary Tables S1 and S2). Importantly, the scoring criteria for HBE was not based on the similarity to the reference solutions used in SBE. Instead, the focus was on whether the solution could address the challenge of leveraging knowledge from different fields. Consequently, solutions with high HBE scores may have included innovative, yet unreported, and effective alternative approaches for solving these challenges.