Introduction

The creation of visual illustrations for original research has become increasingly important in today’s academic landscape. An eye-catching journal cover or a well-crafted graphic abstract quickly draws the attention of potential readers. Besides, effectively designed graphic illustrations distill complex research concepts, making them more accessible to a wider audience. However, the development of these visuals often demands substantial effort and time from researchers, as it requires additional skills in graphic design tools. Consequently, there is a growing need for an alternative method that enables researchers to quickly design research-related visual illustrations.

To enhance the effectiveness of text-to-image generation, researchers have developed models like DALL-E1, Stable Diffusion2, and proposed multiple strategies focusing on changing3,4 and loss function5,6 to improve the generation quality and speed. However, the proficiency of these models in crafting complicated scientific visuals remains largely unexplored.

In the field of AI-assisted text-to-image generation, numerous models like DALL-E7,8 and MidJourney9 have been developed. However, these tools are primarily limited to text inputs and do not support other forms of input such as documents or images. To improve the generation quality, researchers proposed various strategies focusing on changing network structure3,4 and loss function5,6. However, these works need tedious work on collecting data and retraining to adapt to a specific field. Bar-Tal et al.10 introduces an effective method for controllable image generation without retraining by using the open-sourced text-to-image diffusion model. However, the MultiDiffusion framework proposed in Bar-Tal’s work10 cannot be used to generate the complex scientific image with rich and accurate details. Following the same direction for efficient and controllable text-to-image generation used in Bar-Tal’s work10, we also use the open-source image generation tool of GPT-4, which integrates DALL-E 3’s graphic capabilities and can process multi-modal inputs. These features of GPT-4 make it an ideal candidate for our study, which aims to explore the impact of diverse inputs on graphic generation systematically.

Despite the significant advancements in AI-driven scientific visualization, the potential ethical implications of using AI-generated graphics have not been fully addressed. These concerns include the possible misuse of AI-generated images for misinformation, plagiarism, or other unethical purposes. To mitigate these risks, it is essential to adopt measures such as transparent documentation of the generation process, proper citation of AI tools, and the development of advanced verification technologies to detect misuse. Acknowledging these issues is crucial to ensure the responsible use of AI-generated scientific graphics in academic and industrial settings.

Prompt engineering, a crucial aspect of working with language models, has various approaches like contextual learning11 and chain-of-thought (CoT) reasoning12,13. Most of the existing work in this domain focuses on text generation, and there is a notable gap in prompt engineering specifically for image generation14, particularly for scientific graphics. In projects like Chat2VIS15, where LLMs are used for data analysis, the prompts are tailored for tasks like data visualization, but they are not suited for creating complex scientific illustrations that often involve integrating multiple complex elements to be drawn. Our study fills this gap by exploring prompt engineering techniques in GPT-4 for generating detailed and precise scientific illustrations, contributing significantly to the field of AI-driven scientific imagery.

The concept of a cognitive CoT12,13 has been validated as a potent tool for enhancing accuracy and control in large language models16. While CoTs have demonstrated improved outcomes by providing a sequential, reasoned path toward a language answer, their potential to guide image generation, especially for scientific purposes, has not been adequate investigated. Various well-defined prompts17 have been proposed to guide AI models in generating desired outputs with fine-grained control and customization of the results. For instance, in text generation, prompts can be used to specify the tone, style, or content of the generated text18,19,20. However, in scientific visualization, there is a lack of well-defined prompt patterns. Although existing prompts can instruct models to create images with specific attributes, such as “pensive young woman at sunset” or “UFO landing”,21,22,23 they struggle to generate accurate images with more details. Researchers often require extensive fine-tuning over the generated images to meet the exact narration with sufficient details22,24, with many of these attempts ending up with failures with different errors. These endless trials not only make the process time-consuming but also result in low consistency and reliability in the final outputs. Therefore, there is a pressing need to develop effective strategies that address the unique requirements of scientific image generation.

The overarching goal of this study is to critically evaluate GPT-4’s innovative capabilities in generating complex scientific illustrations, focusing on both its capacity and creativity. Two factors should be considered in the topic to be drawn. For capacity measurement, the chosen topic must exhibit inherent complexity, characterized by multiple components and a precise spatial arrangement. The evaluation of these generated images should also be guided by explicit and rigorous criteria, ensuring a comprehensive assessment of the output. Meanwhile, in assessing creativity, it is crucial to devise tasks that preclude GPT-4 from merely retrieving pre-existing figures from its “memorized” knowledge base. Consequently, the task must involve an advanced topic, ideally one with few open-access resources.

Locally Enhanced Electric Field Treatment (LEEFT) is a cutting-edge technology in water disinfection, leveraging the configuration design and/or electrode modification to induce irreversible electroporation, thereby inactivating microorganisms25,26,27,28. Specifically, the electrode modification refers to the growth of nanowires perpendicular to the conventional electrode29,30. The electric field strength near the tip of the nanowires is enhanced dramatically, so to reduce the externally applied voltage31,32,33.

LEEFT disinfection by electrode modification is used as the drawing task because of its higher componential and spatial complexity. Furthermore, LEEFT disinfection has predominantly been documented in subscription-based journals, suggesting limited access to ChatGPT. Open-access news press and patents are limited as well. Therefore, the task of illustrating “LEEFT disinfection” aligns well with both the capacity and creativity evaluation criteria for GPT-4.

To address the identified challenges, we conduct a comprehensive experimental exploration to critically assess the effectiveness of commonly employed prompts in creating scientific illustrations. This detailed investigation reveals the limitations of prevailing text-to-image methodologies, especially in achieving high accuracy, high modifiability (for enhanced detail control), and high reproducibility (for consistency among different illustrations). Based on the above investigations, we have also identified two insights for leveraging the strengths of current large language models (LLMs):

  • Among all of the multimodal inputs, including detailed textual descriptions, referenced images, and papers, we discovered that pure language prompts without any attachments are markedly the most effective;

  • Conventional CoT strategy and prompt iteration for language processing are not effective in image generation even after many versions of updates and revisions. The CoT and prompts specially designed for image generation are highly needed.

As a tangible application of these findings, we have developed GPT4Designer, a framework that is the first to accurately control the LLM generation graphics. Specifically, the following innovations are developed in our GPT4Designer:

  • Inspired by the CoT in language processing, we devise a novel method for a Language-Mediated Image Generation Chain (LaMIGC). This involves the use of GPT-4 to first generate a textual envisioning of the intended image, which acts as an intermediary, language-mediated guide for the subsequent image generation or fine-tuning process. This approach has demonstrated superior results, offering a concise and efficient pathway from concept to visual representation.

  • Through extensive experiments, we reveal the efficient prompt pattern for GPT4’s image creation, generally expressed as a group of multiple “Type: Detailed Description” bulletins for image description. This structure facilitates precision, detail adjustability, and stylish uniformity in the generated images, which marks a significant advancement in the field of scientific image generation.

Methods

To ensure the standardization of experimental strategies, we systematically designed and documented all procedures in a structured format. The strategies tested in this study, including "short prompt + step-by-step revision," "detailed prompt + reference materials," and "envision-first + guided revision," followed pre-defined steps for prompt construction, input format, and evaluation. The detailed execution of each experiment is documented in the Supplementary Information (Figs. S1S16). Additionally, Table S1 and Table S2 provide the judge criteria and individual scores for each generated image.

For scientific illustrations, we adopted a rigorous evaluation framework based on predefined criteria. The specific requirement of the scientific illustration is below. “There should be three components, the nanowires, the bacteria (some are live and some are dead), and enhanced electric field. The nanowires are made of metallic oxides and attached to a plain electrode. The nanowires should be dense, thin, and long, and look like needles. The bacteria are floating above the nanowires. Some of the bacteria are live and some are killed by the electric field. Lighting and electric charges should be placed on the tip of the nanowires. All components are immersed in water.” A previously published journal cover (Fig. 1) can be taken as a reference to the generated image.

Fig. 1
Fig. 1
Full size image

A previously published journal cover depicting the LEEFT for water disinfection.

To further verify and test the universality of the proposed method, we also make GPT-4 to depict a figure based on a Chinese poem “Shu Dao Nan”. Only the name of the poem was provided in the prompt, since GPT-4 could find accurate information of the poem.

Prompt design and reference selection criteria

The framework under consideration is comprised of two components, the first of which is image-specific requirements and the second of which is standardized generative strategies. The image-specific requirements were systematically defined based on the evaluation criteria outlined in Fig. 1, ensuring clarity in composition, accuracy in technical representation, and reproducibility in generated illustrations. The generative strategies (#3-#5) were selected based on widely recognised methodologies in AI-driven image synthesis, enabling a fair and comparative assessment of different prompting techniques.

The selection of reference papers followed a structured approach to ensure methodological rigor and relevance. Initially, all cited works were published within the last three years in subscription-based, non-open-access journals. This criterion was established to ensure that the content represents recent advancements that are inaccessible to ChatGPT’s training data, thus allowing us to evaluate the model’s ability to generate scientific illustrations beyond its pre-existing knowledge base. Secondly, priority was given to papers based on their direct relevance to the study’s focus. The selection process encompassed both review articles and original research, ensuring comprehensive coverage of the topic by balancing foundational insights with novel scientific developments.

This structured approach ensures that the chosen prompt strategies and referenced literature align with contemporary research standards while addressing reproducibility, accuracy, and the broader implications of AI-assisted scientific visualization.

Image creation methods using ChatGPT

OpenAI’s GPT-4 was used for the generation of images because it is easy to test the combination of different types of input. The prompts include both the instruction (i.e., draw a figure), and the input data. The input data contains either narrations of the components with their features (e.g., "The nanowires are made of metallic oxides and attached to a plain electrode.") or attachments for reference (papers or example scientific image), or both (Fig. 2a). A list of papers used in this study can be found in Table S3. Followed-up prompt focused on the revision of the previous image with more input data and/or context. Multiple images were generated either through the function “regenerate” (intra-chat) or through opening a new chat with the same prompt (inter-chat) to avoid the randomness and test the reproducibility, especially when a positive conclusion was yielded.

Fig. 2
Fig. 2
Full size image

Summary of six issues in GPT-4’s creation of scientific images. The benchmark is shown as the target image in Fig. 1. (a) Amnesia. (b) Failure. (c). Inaccurate positioning. (d). Unforeseen components. (e) Chaotic components. (f) A mishmash of styles.

Evaluation methodology

To assess the images produced by GPT-4, we established a comprehensive framework comprising 11 evaluation criteria, focusing on three primary dimensions: modifiability, accuracy, and reproducibility (Tables S1 & S2). Each criterion follows a predefined scoring system (0, 3, 7, or 10 points), ensuring transparency and consistency.

To evaluate reproducibility, we conducted intra-chat and inter-chat generation trials to examine consistency across repeated trials. Standard deviation calculations were used to measure variability (Figs. S10 & S14). Since ChatGPT-integrated DALL-E 3 lacks random seed control, the reproducibility of the model was instead evaluated through statistical analysis and structured prompt strategies.

Methodological rigor and reproducibility

To address concerns regarding potential subjectivity in evaluation criteria and reproducibility, the methodology incorporated a structured framework for image quality assessment, supported by quantitative validation approaches.

Systematic evaluation framework

The image quality evaluation was structured into three dimensions—modifiability, accuracy, and reproducibility—which are further subdivided into 11 specific criteria. These included the positioning and morphology of depicted components (e.g., bacteria, nanowires), texture fidelity, and the representation of abstract concepts such as electric fields (Tables S1 & S2). Each criterion followed a quantitative scoring system (0, 3, 7, or 10 points), with explicit definitions for score levels (e.g., “fully meets requirements,” “partially meets requirements”). This structured scoring framework minimizes ambiguity, ensuring consistency and transparency across all generated images.

Validation through repeated experiments and comparative analysis

To ensure the reliability of generated images, we conducted repeated generation trials using multiple prompt strategies under intra-chat (same session) and inter-chat (cross-session) conditions. The reproducibility of these results was assessed using the standard deviation of scores across trials, which quantifies variability in image quality under different generation conditions. Although GPT-4’s image generation was non-deterministic, our "envision-first" strategy significantly reduced inconsistencies by guiding the conceptualization process. This approach significantly improved intra-chat consistency and reduced inter-chat variability, reinforcing the framework’s practical applicability (Figs. S10 and S14).

Prioritization of scientific utility

The evaluation criteria prioritized scientific accuracy, with 90% of scoring metrics focused on technical parameters (e.g., component positioning, morphological fidelity). Aesthetic quality, contributing only 10% to the total score, was evaluated using objective guidelines (e.g., color scheme uniformity, style consistency) to limit subjective influence. This ensures that generated images primarily aligned with research-oriented requirements.

This multifaceted approach ensured that the evaluation process balances scientific rigor with practical applicability, addressing both technical reproducibility and the reduction of subjective biases.

Results and discussion

Where we are: the limitations and capabilities of traditional prompt engineering using GPT-4

To evaluate the performance of GPT-4 in image generation, we established a comprehensive framework comprising 11 evaluation criteria, focusing on three primary dimensions: modifiability, accuracy, and reproducibility. These criteria are designed to capture key aspects of scientific image quality, including the ability to make precise modifications, the alignment with prompt specifications, and the consistency of regenerated images.

In order to provide further validation of the consistency of our framework, multiple regeneration experiments were conducted under both intra-chat (same session) and inter-chat (new session) conditions. The reproducibility of results was evaluated using statistical analysis, including standard deviation calculations (Table S2). It is noteworthy that the optimized strategies (e.g., strategies #8 and #9) led to a substantial reduction in randomness, resulting in diminished variability and enhanced reproducibility, as demonstrated in Figs. S10 and S14. Detailed scoring standards and examples are provided in Table 1.

Table 1 Evaluation criteria and scoring standards for image generation.

We begin the image generation with different prompt patterns and attachments (Strategies #1-#7). Images generated through conventional strategies consistently scored low, with issues such as amnesia, inaccurate positioning, and inconsistent styles (Table 2). We identify and categorize six primary issues encountered across the seven experiments into three dimensions (Fig. 2). These issues highlight the need for addressing GPT-4’s limitations in creating scientific illustrations.

Table 2 Summary of 10 strategies for image creation with corresponding observation and points.

Modifiability

Modifiability refers to the ability to enact precise alterations to enhance image quality. This capability allows users to refine generated images by modifying, adding, or removing specific components and their characteristics, while preserving the remainder of the image. The degree of modifiability can be quantitatively assessed by tracking the point change across successive revisions. A consistent increase in points after each step-by-step revision indicates a high level of modifiability. Modifiability issues such as “amnesia” and failure to execute modifications as prompted were observed, as summarized in Fig. 2. In successive iterations, new images lose features highlighted in previous versions (Figs. 2a, b). For example, as shown in the Supporting Information (Fig. S12), in successive iterations of an illustration involving nanowires, the features of the wires were incorrectly omitted or altered, demonstrating the limitations of GPT-4 in maintaining feature fidelity.

Accuracy

Accuracy is defined as the extent to which the generated image aligns with the specifications of the given prompt. This includes the correct depiction of all necessary components with their appropriate textures, sizes, and spatial relationships. High accuracy is crucial as it negates the need for incremental modifications, thereby achieving the intended result in a single iteration. The accuracy of an image is quantified by assigning higher points for more accurate representations. Issues include:

  • Inaccurate positioning: Instances where the elements are depicted correctly, but their spatial arrangement does not align with the requirements (Fig. 2c).

  • Unforeseen components: The tendency of GPT-4 to introduce elements not specified in the prompt, leading to the inclusion of unintended components (Fig. 2d).

Reproducibility

Reproducibility refers to the consistency of images produced using the “Regenerate” function, in terms of the style and components. We quantify reproducibility by calculating the standard deviation of the points of regenerated images. A lower standard deviation signifies higher reproducibility, which is desirable for scientific accuracy. For example, Strategy #8 achieved a standard deviation of 0.82 across 18 regenerated images (Table 2), demonstrating high reproducibility. This consistency is essential for ensuring scientific rigor and reliability in image generation. Issues include:

  • Chaotic components: The regenerated duplicates from the same prompt display significant variations in features such as components, size, and spatial arrangement, leading to high uncertainty (Fig. 2e). As illustrated in the Supporting Information (Figs. S10 and S12), repeated regenerations of a scientific image resulted in inconsistent arrangements of nanowires and other elements, leading to significant variability in the outputs.

  • A mishmash of styles: There is a noticeable inconsistency in the painting styles among duplicates generated from the same prompt (Fig. 2f).

Fig. 3
Fig. 3
Full size image

Strategy #1: “short prompt + step-by-step revision”. (a), (b), (c) The 1st, 8th, and 12th images generated by step-by-step guidance, respectively. (d) The points each image earned in the 12 revisions. (e) The point distribution of six regenerated images of the twelfth version. (More information: Figs. S1 and S2).

Notably, reproducibility is critical in designing frameworks for scientific illustrations. To validate our findings, we employ both intragroup repetition (using the “regeneration” function) and intergroup repetition (initiating a new chat session). Due to space constraints, a full version of the regenerated images is provided in the supplementary material.

What we learn: systematic analysis of conventional strategies

In this section, we explore multiple conventional prompt strategies in a systematic refinement manner to reveal how all these commonly applied methods failed and what can we learn from the experimental results.

Strategy #1: Conventional “step-by-step” guidance

We start by employing the most commonly used prompt improved with “step-by-step” guidance, wherein we begin with a concise prompt, such as “Can you help me to draw a figure to illustrate the disinfection of bacteria by the locally enhanced electric field treatment?” This is followed by specific instructions for modifications like “changing the texture of an element” and “adding components” to gradually achieve the desired outcome.

As shown in Figs. 3 and S1, incremental guidance improves image accuracy after multiple iterations (Fig. 3a–c). However, the process is inefficient and requires excessive manual intervention. Specifically, GPT-4 struggles to precisely identify and modify the specific components or elements mentioned in language instructions. This leads to high variability in painting styles across different revisions. Furthermore, reproducibility issues emerge in the final step, illustrated by the lost consistency in regenerated images (Figs. 3d, e and S2).

Strategy #2: Conventional detailed prompt

To address the modifiability issues in Strategy #1, our hypothesis posits that providing ChatGPT with an exceptionally detailed prompt could result in generating all desired features simultaneously. Such a “very detailed” prompt should encompass not only the approach, purpose, and key components, but also an explicit spatial and feature description of each component (Fig. 4a).

Fig. 4
Fig. 4
Full size image

Strategy #2: “detailed prompt”. (a) The requirements used in the detailed prompt to create the image. (b), (c), (d) Three example images produced by the “regenerate” function. (e) The point distribution of six regenerated images. All regenerated images can be found in Fig. S3.

This “detailed prompt” strategy demonstrates significant improvements in image clarity and alignment, outperforming Strategy #1 (Figs. 4 and S3). While some attempts successfully meet all criteria (Fig. 4b), reproducibility issues persisted (Fig. 4c, d). This is evidenced by the high standard deviation in points (76.8 ± 6.8) for the regenerated images using the same prompt (Fig. 4e).

Strategies #3-#5: Conventional supplementary references (papers or images)

We hypothesize that the accuracy issues, such as ChatGPT generating unintended components or misplacing elements, might stem from insufficient information provided to GPT-4. To counteract this, we incorporate scientific papers as additional resources to guide image generation, given their accuracy and detailed descriptions.

Accordingly, we compile a list of LEEFT-related papers (Table S3) to augment GPT-4’s knowledge base. These papers, covering diverse aspects such as the mechanism investigation, electrode design, and system development, were published in subscription-based journals between 2019 and 2023. To assess how effectively ChatGPT assimilates knowledge from these papers, we utilize short prompts to minimize the influence of external narration in the image generation process.

Strategies #3&#4: Short prompt + 5&1 paper(s)

We provide GPT-4 with a compilation of five LEEFT-related papers, consolidated into a single PDF file. As shown in Figs. 5a–d & S4, new features such as the layered and tubular structures appeared, but issues such as inaccuracies in nanowire depiction were observed (Fig. 5a). Additionally, the positioning of bacteria is inaccurately rendered; they are not correctly placed on the tips of the nanowires (Fig. 5b).

Fig. 5
Fig. 5
Full size image

Strategies #3-#5: “short prompt + reference”. (ad), (eh), (il) refer to the strategies “short prompt + 5 reference papers”, “short prompt + 1 reference paper”, and “short prompt + sample image”, respectively. All regenerated images can be found in Figs. S4S6.

We speculate that the information from 5 papers might overwhelm ChatGPT. To test this, we experiment by separately feeding three distinct LEEFT-related papers, each with a different focus. The results demonstrate a significant improvement in the accuracy of images generated from a single paper, regardless of which one is used, compared to those derived from five papers (Figs. 5e–h and S5). This observation supports our hypothesis that an overload of information can be counterproductive for image creation.

Strategy #5: Short prompt + image

The strategy of supplementing ChatGPT’s input with scientific papers appears to be beneficial to some extent. This approach, however, does not completely overcome the challenge of reproducibility. To further address this issue, we explore an alternative approach: providing ChatGPT with an example cover image to emulate its style. This is done to guide the AI’s image generation process more visually, rather than solely relying on text-based instructions.

The results are similar to the strategy #4 (Figs. 5i–l and S6). This similarity suggests that visual cues, much like the targeted information from a single paper, indeed influence the AI’s image creation. However, despite this promising direction, the attempt to have ChatGPT mimic the style of the example cover is not entirely successful. The images generated still exhibited inconsistencies, particularly in terms of style replication and component accuracy. As a result, while the use of example images provides some guidance, it falls short of effectively resolving the reproducibility issue.

Strategies #6&#7: Conventional detailed prompt + supplementary paper refer-ences

The findings from implementing strategy #2 reveal that using a “detailed prompt” approach indeed enhances the quality of the generated images. Additionally, in strategy #4, incorporating a single reference paper improves image accuracy. Given ChatGPT’s robust learning capabilities, we propose a new hypothesis: combining “detailed prompts” with “feeding references” could potentially yield even better results by integrating the detailed narrative guidance alongside reference materials.

While using five papers as references introduces inconsistencies in spatial arrangements, the detailed prompt strategy consistently generates more coherent images (Figs. 6a and S7). When using a single paper, images displayed reasonable stylistic consistency within the same reference, but variations arose when different papers were used (Figs. 6b and S8).

Fig. 6
Fig. 6
Full size image

Strategies #6&#7: “detailed prompt + reference” (ad), and (eh) refer to the strategies “detailed prompt + 5 reference papers”, and “detailed prompt + 1 reference paper”, respectively. All regenerated images can be found in Figs. S7 and S8.

These outcomes are consistent with our earlier experience using the “Short prompt + papers” strategy, reinforcing our hypothesis that an excess of information hinders effective image creation.

Go beyond Conventional Strategies: Innovating with the Proposed GPT4Designer Framework

Two insights drawn from conventional strategies

After evaluating the outcomes of the seven conventional strategies, we have gleaned two key insights into how to address the various quite challenging MAR problems (i.e., modifiability, accuracy, and reproducibility) “altogether”.

  • The need for detailed and accurate prompt design: Ensuring that prompts are rich in detail, including features and spatial arrangements, is crucial for improving accuracy. Equally important is refining the language used in communication with ChatGPT. Clear and accurate prompts organized in a way that GPT-4 can understand may help GPT-4 to grasp all necessary information at once, potentially eliminating the need for modifications and thereby addressing issues related to modifiability.

  • Limited use of reference materials: Attempting to have ChatGPT mimic an uploaded image proved ineffective, likely due to GPT-4’s limited image analysis capabilities. Similarly, caution is advised when using scientific papers as references. Uploading too many papers introduces confusion and detracts from the intended focus. GPT-4 may struggle with extraneous details, such as irrelevant experimental descriptions. A more effective approach might be to focus on the pertinent details from these papers, providing ChatGPT with direct and specific language guidance for drawing.

Refining the language in detailed prompts is essential for improving GPT-4’s understanding and execution of image generation tasks. Drawing inspiration from the CoT approach used in linguistic tasks, we develop a novel concept: leveraging GPT-4’s own language style. The idea is to first have GPT-4 articulate how it envisions the image will be structured. We could then use this initial blueprint to make targeted modifications, thereby guiding the image creation process to achieve higher modifiability, accuracy, and reproducibility.

Strategies #8&#9: Innovative “envision first” (#8) followed by “step-by-step” modifications (#9)

Pursuing the above thought of “envision first”, we conduct three sets of experiments (Strategies #8-#10) to prove our idea. In the first set of experiments, we use short prompts (e.g., “disinfection of bacteria by LEEFT”), and add a unique request: “Can you please first describe your envisioning of this figure?”. This “envision first” approach is tested in two separate chat sessions, along with six instances of image regeneration (Fig. S9 as an example).

Remarkably, this strategy marks the first successful resolution of the reproducibility issue. Within the same chat session, the images produced strictly adhered to the initially envisioned description, regardless of the accuracy of elements or their spatial relationships. Furthermore, the styles of the regenerated images are highly consistent (Figs. 7 and S10).

Fig. 7
Fig. 7
Full size image

Strategy #8: “short prompt + envision”. (ac) Three example images. (d) Point distribution of 18 regenerated images. All regenerated images can be found in Fig. S10.

Beyond resolving technical challenges, the GPT4Designer framework also addresses potential ethical concerns associated with AI-generated scientific graphics. Specifically, the use of detailed prompts and step-by-step revisions inherently provides a transparent record of the image generation process, ensuring traceability and reducing the risk of misuse. Additionally, the envisioning step articulates the intended design before generation, further enhancing transparency and reproducibility. These features promote responsible usage by offering a clear pathway from conceptualization to final visualization. To mitigate ethical risks, we recommend users explicitly cite the GPT4Designer framework and describe its role in creating scientific graphics to maintain academic integrity and prevent plagiarism.

However, we observe variability in the envisioning of “disinfection of bacteria by LEEFT” across different chat sessions, leading to divergent image outputs. This variability likely stems from different knowledge being accessed in each unique conversation. The encouraging finding here is that when the initial envisioning aligns, the resulting images share similar styles.

To delve deeper, we explore whether it is the process of envisioning or the content of the envisioned prompt that influences reproducibility. We test this by using GPT’s envisioned response as a prompt in a new chat session to generate an image (Fig. S11 as an example). When the input prompt matches GPT’s envisioned description, the produced images are consistent in both style and components (Fig. 7c). This leads us to conclude that it is indeed the content of the envisioned prompt that is pivotal. A “perfect” prompt, formulated in GPT’s own language, is key to producing an ideal image.

We follow by solving the modifiability-related issues. With GPT’s envisioned description in its own language as a basis, we provide specific instructions for modifying certain elements. For instance, we request a change in context from a petri dish to water. ChatGPT effectively understands these instructions and integrates the modifications into its revisions. The successful implementation of three revision steps is demonstrated while keeping other parts intact (Figs. 8b, c, and S12).

Fig. 8
Fig. 8
Full size image

Strategy #9: “short prompt + envision + step-by-step revision”. (a), (b), (c), and (d) The 1st, 2nd, 3rd, and 4th images generated by step-by-step guidance, respectively. (e) The points each image earned in the 4 revisions. All regenerated images can be found in Fig. S12.

Further, we experiment with adding new components through adjustments in the system prompt. This leads to the successful inclusion of two additional elements, with their spatial arrangement accurately rendered. As shown in Figs. 8d and S12, the addition of nanowires is successful, and the nanowires of the regenerated images all meet the requirement. As GPT-4 states what to draw every time, the “Amnesia” is cured as well. Thus, we conclude that fine-tuning GPT’s own envisioned prompts effectively resolves the modifiability challenges.

Building on the envision-first approach, GPT4Designer addresses the reproducibility and modifiability issues identified in earlier strategies. For example, as demonstrated in the Supporting Information (Figs. S12 and S14), the envisioned prompt ensured consistent representation of nanowires across iterations. This approach effectively resolved challenges like amnesia and chaotic components, ensuring accurate retention of key features and stylistic uniformity in regenerated images.

Once the issues of modifiability and reproducibility are addressed, we could produce highly accurate images with a few revisions. The significant increase in points to approximately 90 after four revisions underscores the high level of accuracy achieved through this refined approach (Fig. 8e).

Strategy #10: Innovative “one-step” GPT4Designer

Although the above Strategies #9 can satisfy all of the three “MAR” requirements within 4 steps of revisions, we want to step further for less revision, e.g., within only “one-step”. Inspired by the superiority of “detailed prompts” in Strategies #2&#7, we further distill our experiences into a prompt framework for creating images with high accuracy and reproducibility with fewer iterations. This framework follows a sequence: Input a detailed prompt → Envision → Create images (Figs. 9 and S13). Here the “detailed prompt” means a detailed description of the image to be drawn as shown in Fig. 4.

Fig. 9
Fig. 9
Full size image

Strategy #10: “detailed prompt + envision”. (ac) Three regenerated example images. (d) Point distribution of 18 regenerated images. All regenerated images can be found in Fig. S14.

The effectiveness of this framework is evidenced in results Fig. 8, where the images generated not only rigorously adhere to the specified requirements but also consistently achieve high scores, indicating their accuracy. Moreover, these images exhibit a remarkable uniformity in style, underscoring the framework’s reproducibility.

In a further evaluation, we introduce the concept of a “detailed prompt in GPT’s language.” This entails translating the detailed prompt, originally in human language, into a format that aligns with GPT’s natural language processing style (e.g., the GPT4’s envisioned prompts in Fig. S13). The envisioned output from this GPT-styled prompt then guides the image creation. Our findings reveal that this method, much like the detailed human language prompts, also leads to the production of high-quality images with the same high degree of reproducibility (Figs. 8c and S14). This validates the universality of our developed GPT4Designer framework.

A straightforward example

The depiction of LEEFT disinfection may not be intuitive enough to the general scientific community. Therefore, we set the depiction of a Tang poem “Shu Dao Nan” as a straightforward example to demonstrate the power of GPT4- Designer. “Shu Dao Nan” (The Difficult Road to Country called Shu) is a Tang Dynasty poem written by Li Bai. The poem describes the high mountains and treacherous journey through the Sichuan region.

We assign ChatGPT with three sequential tasks: (1) to draw a figure to describe the poem with eagles, and (2) to replace the eagles with parrots, and (3) to replace the parrots with eagles. As shown in Fig. 10, GPT4Designer successfully completes all three tasks with the painting style unchanged to indicate the imposing mountains. Conversely, the conventional prompt engineering approach encountered problems like amnesia, unforeseen components, and a mishmash of styles. After two times of revision, the imposing mountains are lost, and replaced by splendid landscapes. Meanwhile, the “eagles” in Fig. 10c show features of the previous colorful parrots. Details can be found in the supplementary Text S2, Figs. S15 and S16. In a nutshell, GPT4Designer shows higher modifiability, accuracy, and reproducibility over the conventional prompt engineering approach.

Fig. 10
Fig. 10
Full size image

The comparisons between the conventional approach (ac) and our proposed GPT4Designer (df) for a straightforward example of drawing the scene for the famous poem of “Shu Dao Nan”. All regenerated images can be found in Figs. S15 and S16.

General applicability of GPT4Designer

The GPT4Designer framework is originally developed based on LEEFT disinfection. To validate the general applicability of GPT4Designer, three more examples are examined, including two recently selected Environmental Science & Technology journal covers.

In the journal cover examples, the same detailed prompts (Test S2) are employed across two trials. The difference is that in one trial, the generation of an envision is enforced (GPT4Designer), and in other, the envision step is omitted (control). As shown in Fig. 11, the generated images with envisioning (with blue columns) closely resemble the original designs, compared to those created without it (with red columns). These results underscore the universally high accuracy of the GPT4Designer framework. This framework has demonstrated its adaptability across both scientific and artistic domains, such as recreating journal covers and illustrating literary concepts (Figs. S15 and S16). By addressing modifiability, accuracy, and reproducibility, GPT4Designer provides a robust tool for diverse applications.

Fig. 11
Fig. 11
Full size image

Journal cover examples illustrating the high accuracy of GPT4Designer framework.

Conclusion

In this study, we systematically investigate the capability of GPT-4 to generate scientific images. Our findings reveal that while GPT-4 exhibits proficiency in image creation, it encounters notable challenges in ensuring modifiability, accuracy, and reproducibility when using conventional prompt engineering strategies. These challenges mirror the complexities faced by the human brain when translating abstract concepts into concrete visual representations. To address these issues, we develop GPT4Designer, a novel framework that leverages the “detailed prompt + envision-first” pipeline. This approach, inspired by the Chain-of-Thought, significantly reduces the need for prompt modifications and closely aligns the output with the initial concept description, thereby enhancing the precision and consistency of the generated images. The success of GPT4Designer marks a significant leap in AI, emulating the human mind’s ability to transform abstract concepts into detailed visuals, thereby overcoming traditional AI limitations in scientific image generation. This breakthrough, crucial for scientists needing quick and accurate visual tools, paves the way for future AI advancements that blend AI capabilities with human-like thinking to address complex scientific problems.

Future work should focus not just on improving the technical capabilities of AI-driven frameworks, but also address their ethical implications. This includes developing robust tools to verify the authenticity of AI-generated graphics, creating educational resources for researchers on the responsible use of AI tools, and fostering a culture of transparency and accountability in academic and industrial applications. By incorporating these considerations, we can ensure that the use of AI in scientific visualization is both innovative and ethically sound.