Abstract
The integration of text-to-image generation capabilities within GPT-4 allows for the convenient creation of various graphics. However, the proficiency of GPT-4 in crafting challenging scientific visuals remains largely unexplored. In this study, we conduct systematic experiments by employing multiple prompt engineering techniques with various supplementary materials to generate complex scientific illustrations for environmental studies. The locally enhanced electric field treatment for water disinfection is used as an example to illustrate the universal reflection of GPT-4 in graphic creation. From the experiments, we summarize that the existing prompt methods struggle in accuracy, modifiability, and reproducibility for scientific image generation. Based on the findings and insights drawn from the extensive experimental results, we develop GPT4Designer, a framework intended to generate scientific images without tedious prompt modifications. Specifically, a simple but surprisingly effective “envision-first” strategy by combining detailed prompting and guided envisioning is developed in the GPT4Designer framework. This strategy yields images with consistent styles aligned with the initial envisioning, significantly improving modifiability. Besides, by refining the conceptualization phase, we achieve much better control over the output, resulting in both high accuracy and reproducibility. This advancement is not only crucial for environmental scientists seeking to quickly produce engaging and accurate visuals (e.g., with only one step), but also demonstrates the existence “chain-of-thought” in image generation, which can inspire more works on the creative application of text-to-image generation models or tools.
Similar content being viewed by others
Introduction
The creation of visual illustrations for original research has become increasingly important in today’s academic landscape. An eye-catching journal cover or a well-crafted graphic abstract quickly draws the attention of potential readers. Besides, effectively designed graphic illustrations distill complex research concepts, making them more accessible to a wider audience. However, the development of these visuals often demands substantial effort and time from researchers, as it requires additional skills in graphic design tools. Consequently, there is a growing need for an alternative method that enables researchers to quickly design research-related visual illustrations.
To enhance the effectiveness of text-to-image generation, researchers have developed models like DALL-E1, Stable Diffusion2, and proposed multiple strategies focusing on changing3,4 and loss function5,6 to improve the generation quality and speed. However, the proficiency of these models in crafting complicated scientific visuals remains largely unexplored.
In the field of AI-assisted text-to-image generation, numerous models like DALL-E7,8 and MidJourney9 have been developed. However, these tools are primarily limited to text inputs and do not support other forms of input such as documents or images. To improve the generation quality, researchers proposed various strategies focusing on changing network structure3,4 and loss function5,6. However, these works need tedious work on collecting data and retraining to adapt to a specific field. Bar-Tal et al.10 introduces an effective method for controllable image generation without retraining by using the open-sourced text-to-image diffusion model. However, the MultiDiffusion framework proposed in Bar-Tal’s work10 cannot be used to generate the complex scientific image with rich and accurate details. Following the same direction for efficient and controllable text-to-image generation used in Bar-Tal’s work10, we also use the open-source image generation tool of GPT-4, which integrates DALL-E 3’s graphic capabilities and can process multi-modal inputs. These features of GPT-4 make it an ideal candidate for our study, which aims to explore the impact of diverse inputs on graphic generation systematically.
Despite the significant advancements in AI-driven scientific visualization, the potential ethical implications of using AI-generated graphics have not been fully addressed. These concerns include the possible misuse of AI-generated images for misinformation, plagiarism, or other unethical purposes. To mitigate these risks, it is essential to adopt measures such as transparent documentation of the generation process, proper citation of AI tools, and the development of advanced verification technologies to detect misuse. Acknowledging these issues is crucial to ensure the responsible use of AI-generated scientific graphics in academic and industrial settings.
Prompt engineering, a crucial aspect of working with language models, has various approaches like contextual learning11 and chain-of-thought (CoT) reasoning12,13. Most of the existing work in this domain focuses on text generation, and there is a notable gap in prompt engineering specifically for image generation14, particularly for scientific graphics. In projects like Chat2VIS15, where LLMs are used for data analysis, the prompts are tailored for tasks like data visualization, but they are not suited for creating complex scientific illustrations that often involve integrating multiple complex elements to be drawn. Our study fills this gap by exploring prompt engineering techniques in GPT-4 for generating detailed and precise scientific illustrations, contributing significantly to the field of AI-driven scientific imagery.
The concept of a cognitive CoT12,13 has been validated as a potent tool for enhancing accuracy and control in large language models16. While CoTs have demonstrated improved outcomes by providing a sequential, reasoned path toward a language answer, their potential to guide image generation, especially for scientific purposes, has not been adequate investigated. Various well-defined prompts17 have been proposed to guide AI models in generating desired outputs with fine-grained control and customization of the results. For instance, in text generation, prompts can be used to specify the tone, style, or content of the generated text18,19,20. However, in scientific visualization, there is a lack of well-defined prompt patterns. Although existing prompts can instruct models to create images with specific attributes, such as “pensive young woman at sunset” or “UFO landing”,21,22,23 they struggle to generate accurate images with more details. Researchers often require extensive fine-tuning over the generated images to meet the exact narration with sufficient details22,24, with many of these attempts ending up with failures with different errors. These endless trials not only make the process time-consuming but also result in low consistency and reliability in the final outputs. Therefore, there is a pressing need to develop effective strategies that address the unique requirements of scientific image generation.
The overarching goal of this study is to critically evaluate GPT-4’s innovative capabilities in generating complex scientific illustrations, focusing on both its capacity and creativity. Two factors should be considered in the topic to be drawn. For capacity measurement, the chosen topic must exhibit inherent complexity, characterized by multiple components and a precise spatial arrangement. The evaluation of these generated images should also be guided by explicit and rigorous criteria, ensuring a comprehensive assessment of the output. Meanwhile, in assessing creativity, it is crucial to devise tasks that preclude GPT-4 from merely retrieving pre-existing figures from its “memorized” knowledge base. Consequently, the task must involve an advanced topic, ideally one with few open-access resources.
Locally Enhanced Electric Field Treatment (LEEFT) is a cutting-edge technology in water disinfection, leveraging the configuration design and/or electrode modification to induce irreversible electroporation, thereby inactivating microorganisms25,26,27,28. Specifically, the electrode modification refers to the growth of nanowires perpendicular to the conventional electrode29,30. The electric field strength near the tip of the nanowires is enhanced dramatically, so to reduce the externally applied voltage31,32,33.
LEEFT disinfection by electrode modification is used as the drawing task because of its higher componential and spatial complexity. Furthermore, LEEFT disinfection has predominantly been documented in subscription-based journals, suggesting limited access to ChatGPT. Open-access news press and patents are limited as well. Therefore, the task of illustrating “LEEFT disinfection” aligns well with both the capacity and creativity evaluation criteria for GPT-4.
To address the identified challenges, we conduct a comprehensive experimental exploration to critically assess the effectiveness of commonly employed prompts in creating scientific illustrations. This detailed investigation reveals the limitations of prevailing text-to-image methodologies, especially in achieving high accuracy, high modifiability (for enhanced detail control), and high reproducibility (for consistency among different illustrations). Based on the above investigations, we have also identified two insights for leveraging the strengths of current large language models (LLMs):
-
Among all of the multimodal inputs, including detailed textual descriptions, referenced images, and papers, we discovered that pure language prompts without any attachments are markedly the most effective;
-
Conventional CoT strategy and prompt iteration for language processing are not effective in image generation even after many versions of updates and revisions. The CoT and prompts specially designed for image generation are highly needed.
As a tangible application of these findings, we have developed GPT4Designer, a framework that is the first to accurately control the LLM generation graphics. Specifically, the following innovations are developed in our GPT4Designer:
-
Inspired by the CoT in language processing, we devise a novel method for a Language-Mediated Image Generation Chain (LaMIGC). This involves the use of GPT-4 to first generate a textual envisioning of the intended image, which acts as an intermediary, language-mediated guide for the subsequent image generation or fine-tuning process. This approach has demonstrated superior results, offering a concise and efficient pathway from concept to visual representation.
-
Through extensive experiments, we reveal the efficient prompt pattern for GPT4’s image creation, generally expressed as a group of multiple “Type: Detailed Description” bulletins for image description. This structure facilitates precision, detail adjustability, and stylish uniformity in the generated images, which marks a significant advancement in the field of scientific image generation.
Methods
To ensure the standardization of experimental strategies, we systematically designed and documented all procedures in a structured format. The strategies tested in this study, including "short prompt + step-by-step revision," "detailed prompt + reference materials," and "envision-first + guided revision," followed pre-defined steps for prompt construction, input format, and evaluation. The detailed execution of each experiment is documented in the Supplementary Information (Figs. S1–S16). Additionally, Table S1 and Table S2 provide the judge criteria and individual scores for each generated image.
For scientific illustrations, we adopted a rigorous evaluation framework based on predefined criteria. The specific requirement of the scientific illustration is below. “There should be three components, the nanowires, the bacteria (some are live and some are dead), and enhanced electric field. The nanowires are made of metallic oxides and attached to a plain electrode. The nanowires should be dense, thin, and long, and look like needles. The bacteria are floating above the nanowires. Some of the bacteria are live and some are killed by the electric field. Lighting and electric charges should be placed on the tip of the nanowires. All components are immersed in water.” A previously published journal cover (Fig. 1) can be taken as a reference to the generated image.
A previously published journal cover depicting the LEEFT for water disinfection.
To further verify and test the universality of the proposed method, we also make GPT-4 to depict a figure based on a Chinese poem “Shu Dao Nan”. Only the name of the poem was provided in the prompt, since GPT-4 could find accurate information of the poem.
Prompt design and reference selection criteria
The framework under consideration is comprised of two components, the first of which is image-specific requirements and the second of which is standardized generative strategies. The image-specific requirements were systematically defined based on the evaluation criteria outlined in Fig. 1, ensuring clarity in composition, accuracy in technical representation, and reproducibility in generated illustrations. The generative strategies (#3-#5) were selected based on widely recognised methodologies in AI-driven image synthesis, enabling a fair and comparative assessment of different prompting techniques.
The selection of reference papers followed a structured approach to ensure methodological rigor and relevance. Initially, all cited works were published within the last three years in subscription-based, non-open-access journals. This criterion was established to ensure that the content represents recent advancements that are inaccessible to ChatGPT’s training data, thus allowing us to evaluate the model’s ability to generate scientific illustrations beyond its pre-existing knowledge base. Secondly, priority was given to papers based on their direct relevance to the study’s focus. The selection process encompassed both review articles and original research, ensuring comprehensive coverage of the topic by balancing foundational insights with novel scientific developments.
This structured approach ensures that the chosen prompt strategies and referenced literature align with contemporary research standards while addressing reproducibility, accuracy, and the broader implications of AI-assisted scientific visualization.
Image creation methods using ChatGPT
OpenAI’s GPT-4 was used for the generation of images because it is easy to test the combination of different types of input. The prompts include both the instruction (i.e., draw a figure), and the input data. The input data contains either narrations of the components with their features (e.g., "The nanowires are made of metallic oxides and attached to a plain electrode.") or attachments for reference (papers or example scientific image), or both (Fig. 2a). A list of papers used in this study can be found in Table S3. Followed-up prompt focused on the revision of the previous image with more input data and/or context. Multiple images were generated either through the function “regenerate” (intra-chat) or through opening a new chat with the same prompt (inter-chat) to avoid the randomness and test the reproducibility, especially when a positive conclusion was yielded.
Summary of six issues in GPT-4’s creation of scientific images. The benchmark is shown as the target image in Fig. 1. (a) Amnesia. (b) Failure. (c). Inaccurate positioning. (d). Unforeseen components. (e) Chaotic components. (f) A mishmash of styles.
Evaluation methodology
To assess the images produced by GPT-4, we established a comprehensive framework comprising 11 evaluation criteria, focusing on three primary dimensions: modifiability, accuracy, and reproducibility (Tables S1 & S2). Each criterion follows a predefined scoring system (0, 3, 7, or 10 points), ensuring transparency and consistency.
To evaluate reproducibility, we conducted intra-chat and inter-chat generation trials to examine consistency across repeated trials. Standard deviation calculations were used to measure variability (Figs. S10 & S14). Since ChatGPT-integrated DALL-E 3 lacks random seed control, the reproducibility of the model was instead evaluated through statistical analysis and structured prompt strategies.
Methodological rigor and reproducibility
To address concerns regarding potential subjectivity in evaluation criteria and reproducibility, the methodology incorporated a structured framework for image quality assessment, supported by quantitative validation approaches.
Systematic evaluation framework
The image quality evaluation was structured into three dimensions—modifiability, accuracy, and reproducibility—which are further subdivided into 11 specific criteria. These included the positioning and morphology of depicted components (e.g., bacteria, nanowires), texture fidelity, and the representation of abstract concepts such as electric fields (Tables S1 & S2). Each criterion followed a quantitative scoring system (0, 3, 7, or 10 points), with explicit definitions for score levels (e.g., “fully meets requirements,” “partially meets requirements”). This structured scoring framework minimizes ambiguity, ensuring consistency and transparency across all generated images.
Validation through repeated experiments and comparative analysis
To ensure the reliability of generated images, we conducted repeated generation trials using multiple prompt strategies under intra-chat (same session) and inter-chat (cross-session) conditions. The reproducibility of these results was assessed using the standard deviation of scores across trials, which quantifies variability in image quality under different generation conditions. Although GPT-4’s image generation was non-deterministic, our "envision-first" strategy significantly reduced inconsistencies by guiding the conceptualization process. This approach significantly improved intra-chat consistency and reduced inter-chat variability, reinforcing the framework’s practical applicability (Figs. S10 and S14).
Prioritization of scientific utility
The evaluation criteria prioritized scientific accuracy, with 90% of scoring metrics focused on technical parameters (e.g., component positioning, morphological fidelity). Aesthetic quality, contributing only 10% to the total score, was evaluated using objective guidelines (e.g., color scheme uniformity, style consistency) to limit subjective influence. This ensures that generated images primarily aligned with research-oriented requirements.
This multifaceted approach ensured that the evaluation process balances scientific rigor with practical applicability, addressing both technical reproducibility and the reduction of subjective biases.
Results and discussion
Where we are: the limitations and capabilities of traditional prompt engineering using GPT-4
To evaluate the performance of GPT-4 in image generation, we established a comprehensive framework comprising 11 evaluation criteria, focusing on three primary dimensions: modifiability, accuracy, and reproducibility. These criteria are designed to capture key aspects of scientific image quality, including the ability to make precise modifications, the alignment with prompt specifications, and the consistency of regenerated images.
In order to provide further validation of the consistency of our framework, multiple regeneration experiments were conducted under both intra-chat (same session) and inter-chat (new session) conditions. The reproducibility of results was evaluated using statistical analysis, including standard deviation calculations (Table S2). It is noteworthy that the optimized strategies (e.g., strategies #8 and #9) led to a substantial reduction in randomness, resulting in diminished variability and enhanced reproducibility, as demonstrated in Figs. S10 and S14. Detailed scoring standards and examples are provided in Table 1.
We begin the image generation with different prompt patterns and attachments (Strategies #1-#7). Images generated through conventional strategies consistently scored low, with issues such as amnesia, inaccurate positioning, and inconsistent styles (Table 2). We identify and categorize six primary issues encountered across the seven experiments into three dimensions (Fig. 2). These issues highlight the need for addressing GPT-4’s limitations in creating scientific illustrations.
Modifiability
Modifiability refers to the ability to enact precise alterations to enhance image quality. This capability allows users to refine generated images by modifying, adding, or removing specific components and their characteristics, while preserving the remainder of the image. The degree of modifiability can be quantitatively assessed by tracking the point change across successive revisions. A consistent increase in points after each step-by-step revision indicates a high level of modifiability. Modifiability issues such as “amnesia” and failure to execute modifications as prompted were observed, as summarized in Fig. 2. In successive iterations, new images lose features highlighted in previous versions (Figs. 2a, b). For example, as shown in the Supporting Information (Fig. S12), in successive iterations of an illustration involving nanowires, the features of the wires were incorrectly omitted or altered, demonstrating the limitations of GPT-4 in maintaining feature fidelity.
Accuracy
Accuracy is defined as the extent to which the generated image aligns with the specifications of the given prompt. This includes the correct depiction of all necessary components with their appropriate textures, sizes, and spatial relationships. High accuracy is crucial as it negates the need for incremental modifications, thereby achieving the intended result in a single iteration. The accuracy of an image is quantified by assigning higher points for more accurate representations. Issues include:
-
Inaccurate positioning: Instances where the elements are depicted correctly, but their spatial arrangement does not align with the requirements (Fig. 2c).
-
Unforeseen components: The tendency of GPT-4 to introduce elements not specified in the prompt, leading to the inclusion of unintended components (Fig. 2d).
Reproducibility
Reproducibility refers to the consistency of images produced using the “Regenerate” function, in terms of the style and components. We quantify reproducibility by calculating the standard deviation of the points of regenerated images. A lower standard deviation signifies higher reproducibility, which is desirable for scientific accuracy. For example, Strategy #8 achieved a standard deviation of 0.82 across 18 regenerated images (Table 2), demonstrating high reproducibility. This consistency is essential for ensuring scientific rigor and reliability in image generation. Issues include:
-
Chaotic components: The regenerated duplicates from the same prompt display significant variations in features such as components, size, and spatial arrangement, leading to high uncertainty (Fig. 2e). As illustrated in the Supporting Information (Figs. S10 and S12), repeated regenerations of a scientific image resulted in inconsistent arrangements of nanowires and other elements, leading to significant variability in the outputs.
-
A mishmash of styles: There is a noticeable inconsistency in the painting styles among duplicates generated from the same prompt (Fig. 2f).
Strategy #1: “short prompt + step-by-step revision”. (a), (b), (c) The 1st, 8th, and 12th images generated by step-by-step guidance, respectively. (d) The points each image earned in the 12 revisions. (e) The point distribution of six regenerated images of the twelfth version. (More information: Figs. S1 and S2).
Notably, reproducibility is critical in designing frameworks for scientific illustrations. To validate our findings, we employ both intragroup repetition (using the “regeneration” function) and intergroup repetition (initiating a new chat session). Due to space constraints, a full version of the regenerated images is provided in the supplementary material.
What we learn: systematic analysis of conventional strategies
In this section, we explore multiple conventional prompt strategies in a systematic refinement manner to reveal how all these commonly applied methods failed and what can we learn from the experimental results.
Strategy #1: Conventional “step-by-step” guidance
We start by employing the most commonly used prompt improved with “step-by-step” guidance, wherein we begin with a concise prompt, such as “Can you help me to draw a figure to illustrate the disinfection of bacteria by the locally enhanced electric field treatment?” This is followed by specific instructions for modifications like “changing the texture of an element” and “adding components” to gradually achieve the desired outcome.
As shown in Figs. 3 and S1, incremental guidance improves image accuracy after multiple iterations (Fig. 3a–c). However, the process is inefficient and requires excessive manual intervention. Specifically, GPT-4 struggles to precisely identify and modify the specific components or elements mentioned in language instructions. This leads to high variability in painting styles across different revisions. Furthermore, reproducibility issues emerge in the final step, illustrated by the lost consistency in regenerated images (Figs. 3d, e and S2).
Strategy #2: Conventional detailed prompt
To address the modifiability issues in Strategy #1, our hypothesis posits that providing ChatGPT with an exceptionally detailed prompt could result in generating all desired features simultaneously. Such a “very detailed” prompt should encompass not only the approach, purpose, and key components, but also an explicit spatial and feature description of each component (Fig. 4a).
Strategy #2: “detailed prompt”. (a) The requirements used in the detailed prompt to create the image. (b), (c), (d) Three example images produced by the “regenerate” function. (e) The point distribution of six regenerated images. All regenerated images can be found in Fig. S3.
This “detailed prompt” strategy demonstrates significant improvements in image clarity and alignment, outperforming Strategy #1 (Figs. 4 and S3). While some attempts successfully meet all criteria (Fig. 4b), reproducibility issues persisted (Fig. 4c, d). This is evidenced by the high standard deviation in points (76.8 ± 6.8) for the regenerated images using the same prompt (Fig. 4e).
Strategies #3-#5: Conventional supplementary references (papers or images)
We hypothesize that the accuracy issues, such as ChatGPT generating unintended components or misplacing elements, might stem from insufficient information provided to GPT-4. To counteract this, we incorporate scientific papers as additional resources to guide image generation, given their accuracy and detailed descriptions.
Accordingly, we compile a list of LEEFT-related papers (Table S3) to augment GPT-4’s knowledge base. These papers, covering diverse aspects such as the mechanism investigation, electrode design, and system development, were published in subscription-based journals between 2019 and 2023. To assess how effectively ChatGPT assimilates knowledge from these papers, we utilize short prompts to minimize the influence of external narration in the image generation process.
Strategies #3: Short prompt + 5&1 paper(s)
We provide GPT-4 with a compilation of five LEEFT-related papers, consolidated into a single PDF file. As shown in Figs. 5a–d & S4, new features such as the layered and tubular structures appeared, but issues such as inaccuracies in nanowire depiction were observed (Fig. 5a). Additionally, the positioning of bacteria is inaccurately rendered; they are not correctly placed on the tips of the nanowires (Fig. 5b).
We speculate that the information from 5 papers might overwhelm ChatGPT. To test this, we experiment by separately feeding three distinct LEEFT-related papers, each with a different focus. The results demonstrate a significant improvement in the accuracy of images generated from a single paper, regardless of which one is used, compared to those derived from five papers (Figs. 5e–h and S5). This observation supports our hypothesis that an overload of information can be counterproductive for image creation.
Strategy #5: Short prompt + image
The strategy of supplementing ChatGPT’s input with scientific papers appears to be beneficial to some extent. This approach, however, does not completely overcome the challenge of reproducibility. To further address this issue, we explore an alternative approach: providing ChatGPT with an example cover image to emulate its style. This is done to guide the AI’s image generation process more visually, rather than solely relying on text-based instructions.
The results are similar to the strategy #4 (Figs. 5i–l and S6). This similarity suggests that visual cues, much like the targeted information from a single paper, indeed influence the AI’s image creation. However, despite this promising direction, the attempt to have ChatGPT mimic the style of the example cover is not entirely successful. The images generated still exhibited inconsistencies, particularly in terms of style replication and component accuracy. As a result, while the use of example images provides some guidance, it falls short of effectively resolving the reproducibility issue.
Strategies #6: Conventional detailed prompt + supplementary paper refer-ences
The findings from implementing strategy #2 reveal that using a “detailed prompt” approach indeed enhances the quality of the generated images. Additionally, in strategy #4, incorporating a single reference paper improves image accuracy. Given ChatGPT’s robust learning capabilities, we propose a new hypothesis: combining “detailed prompts” with “feeding references” could potentially yield even better results by integrating the detailed narrative guidance alongside reference materials.
While using five papers as references introduces inconsistencies in spatial arrangements, the detailed prompt strategy consistently generates more coherent images (Figs. 6a and S7). When using a single paper, images displayed reasonable stylistic consistency within the same reference, but variations arose when different papers were used (Figs. 6b and S8).
These outcomes are consistent with our earlier experience using the “Short prompt + papers” strategy, reinforcing our hypothesis that an excess of information hinders effective image creation.
Go beyond Conventional Strategies: Innovating with the Proposed GPT4Designer Framework
Two insights drawn from conventional strategies
After evaluating the outcomes of the seven conventional strategies, we have gleaned two key insights into how to address the various quite challenging MAR problems (i.e., modifiability, accuracy, and reproducibility) “altogether”.
-
The need for detailed and accurate prompt design: Ensuring that prompts are rich in detail, including features and spatial arrangements, is crucial for improving accuracy. Equally important is refining the language used in communication with ChatGPT. Clear and accurate prompts organized in a way that GPT-4 can understand may help GPT-4 to grasp all necessary information at once, potentially eliminating the need for modifications and thereby addressing issues related to modifiability.
-
Limited use of reference materials: Attempting to have ChatGPT mimic an uploaded image proved ineffective, likely due to GPT-4’s limited image analysis capabilities. Similarly, caution is advised when using scientific papers as references. Uploading too many papers introduces confusion and detracts from the intended focus. GPT-4 may struggle with extraneous details, such as irrelevant experimental descriptions. A more effective approach might be to focus on the pertinent details from these papers, providing ChatGPT with direct and specific language guidance for drawing.
Refining the language in detailed prompts is essential for improving GPT-4’s understanding and execution of image generation tasks. Drawing inspiration from the CoT approach used in linguistic tasks, we develop a novel concept: leveraging GPT-4’s own language style. The idea is to first have GPT-4 articulate how it envisions the image will be structured. We could then use this initial blueprint to make targeted modifications, thereby guiding the image creation process to achieve higher modifiability, accuracy, and reproducibility.
Strategies #8	: Innovative “envision first” (#8) followed by “step-by-step” modifications (#9)
Pursuing the above thought of “envision first”, we conduct three sets of experiments (Strategies #8-#10) to prove our idea. In the first set of experiments, we use short prompts (e.g., “disinfection of bacteria by LEEFT”), and add a unique request: “Can you please first describe your envisioning of this figure?”. This “envision first” approach is tested in two separate chat sessions, along with six instances of image regeneration (Fig. S9 as an example).
Remarkably, this strategy marks the first successful resolution of the reproducibility issue. Within the same chat session, the images produced strictly adhered to the initially envisioned description, regardless of the accuracy of elements or their spatial relationships. Furthermore, the styles of the regenerated images are highly consistent (Figs. 7 and S10).
Strategy #8: “short prompt + envision”. (a–c) Three example images. (d) Point distribution of 18 regenerated images. All regenerated images can be found in Fig. S10.
Beyond resolving technical challenges, the GPT4Designer framework also addresses potential ethical concerns associated with AI-generated scientific graphics. Specifically, the use of detailed prompts and step-by-step revisions inherently provides a transparent record of the image generation process, ensuring traceability and reducing the risk of misuse. Additionally, the envisioning step articulates the intended design before generation, further enhancing transparency and reproducibility. These features promote responsible usage by offering a clear pathway from conceptualization to final visualization. To mitigate ethical risks, we recommend users explicitly cite the GPT4Designer framework and describe its role in creating scientific graphics to maintain academic integrity and prevent plagiarism.
However, we observe variability in the envisioning of “disinfection of bacteria by LEEFT” across different chat sessions, leading to divergent image outputs. This variability likely stems from different knowledge being accessed in each unique conversation. The encouraging finding here is that when the initial envisioning aligns, the resulting images share similar styles.
To delve deeper, we explore whether it is the process of envisioning or the content of the envisioned prompt that influences reproducibility. We test this by using GPT’s envisioned response as a prompt in a new chat session to generate an image (Fig. S11 as an example). When the input prompt matches GPT’s envisioned description, the produced images are consistent in both style and components (Fig. 7c). This leads us to conclude that it is indeed the content of the envisioned prompt that is pivotal. A “perfect” prompt, formulated in GPT’s own language, is key to producing an ideal image.
We follow by solving the modifiability-related issues. With GPT’s envisioned description in its own language as a basis, we provide specific instructions for modifying certain elements. For instance, we request a change in context from a petri dish to water. ChatGPT effectively understands these instructions and integrates the modifications into its revisions. The successful implementation of three revision steps is demonstrated while keeping other parts intact (Figs. 8b, c, and S12).
Strategy #9: “short prompt + envision + step-by-step revision”. (a), (b), (c), and (d) The 1st, 2nd, 3rd, and 4th images generated by step-by-step guidance, respectively. (e) The points each image earned in the 4 revisions. All regenerated images can be found in Fig. S12.
Further, we experiment with adding new components through adjustments in the system prompt. This leads to the successful inclusion of two additional elements, with their spatial arrangement accurately rendered. As shown in Figs. 8d and S12, the addition of nanowires is successful, and the nanowires of the regenerated images all meet the requirement. As GPT-4 states what to draw every time, the “Amnesia” is cured as well. Thus, we conclude that fine-tuning GPT’s own envisioned prompts effectively resolves the modifiability challenges.
Building on the envision-first approach, GPT4Designer addresses the reproducibility and modifiability issues identified in earlier strategies. For example, as demonstrated in the Supporting Information (Figs. S12 and S14), the envisioned prompt ensured consistent representation of nanowires across iterations. This approach effectively resolved challenges like amnesia and chaotic components, ensuring accurate retention of key features and stylistic uniformity in regenerated images.
Once the issues of modifiability and reproducibility are addressed, we could produce highly accurate images with a few revisions. The significant increase in points to approximately 90 after four revisions underscores the high level of accuracy achieved through this refined approach (Fig. 8e).
Strategy #10: Innovative “one-step” GPT4Designer
Although the above Strategies #9 can satisfy all of the three “MAR” requirements within 4 steps of revisions, we want to step further for less revision, e.g., within only “one-step”. Inspired by the superiority of “detailed prompts” in Strategies #2, we further distill our experiences into a prompt framework for creating images with high accuracy and reproducibility with fewer iterations. This framework follows a sequence: Input a detailed prompt → Envision → Create images (Figs. 9 and S13). Here the “detailed prompt” means a detailed description of the image to be drawn as shown in Fig. 4.
Strategy #10: “detailed prompt + envision”. (a–c) Three regenerated example images. (d) Point distribution of 18 regenerated images. All regenerated images can be found in Fig. S14.
The effectiveness of this framework is evidenced in results Fig. 8, where the images generated not only rigorously adhere to the specified requirements but also consistently achieve high scores, indicating their accuracy. Moreover, these images exhibit a remarkable uniformity in style, underscoring the framework’s reproducibility.
In a further evaluation, we introduce the concept of a “detailed prompt in GPT’s language.” This entails translating the detailed prompt, originally in human language, into a format that aligns with GPT’s natural language processing style (e.g., the GPT4’s envisioned prompts in Fig. S13). The envisioned output from this GPT-styled prompt then guides the image creation. Our findings reveal that this method, much like the detailed human language prompts, also leads to the production of high-quality images with the same high degree of reproducibility (Figs. 8c and S14). This validates the universality of our developed GPT4Designer framework.
A straightforward example
The depiction of LEEFT disinfection may not be intuitive enough to the general scientific community. Therefore, we set the depiction of a Tang poem “Shu Dao Nan” as a straightforward example to demonstrate the power of GPT4- Designer. “Shu Dao Nan” (The Difficult Road to Country called Shu) is a Tang Dynasty poem written by Li Bai. The poem describes the high mountains and treacherous journey through the Sichuan region.
We assign ChatGPT with three sequential tasks: (1) to draw a figure to describe the poem with eagles, and (2) to replace the eagles with parrots, and (3) to replace the parrots with eagles. As shown in Fig. 10, GPT4Designer successfully completes all three tasks with the painting style unchanged to indicate the imposing mountains. Conversely, the conventional prompt engineering approach encountered problems like amnesia, unforeseen components, and a mishmash of styles. After two times of revision, the imposing mountains are lost, and replaced by splendid landscapes. Meanwhile, the “eagles” in Fig. 10c show features of the previous colorful parrots. Details can be found in the supplementary Text S2, Figs. S15 and S16. In a nutshell, GPT4Designer shows higher modifiability, accuracy, and reproducibility over the conventional prompt engineering approach.
General applicability of GPT4Designer
The GPT4Designer framework is originally developed based on LEEFT disinfection. To validate the general applicability of GPT4Designer, three more examples are examined, including two recently selected Environmental Science & Technology journal covers.
In the journal cover examples, the same detailed prompts (Test S2) are employed across two trials. The difference is that in one trial, the generation of an envision is enforced (GPT4Designer), and in other, the envision step is omitted (control). As shown in Fig. 11, the generated images with envisioning (with blue columns) closely resemble the original designs, compared to those created without it (with red columns). These results underscore the universally high accuracy of the GPT4Designer framework. This framework has demonstrated its adaptability across both scientific and artistic domains, such as recreating journal covers and illustrating literary concepts (Figs. S15 and S16). By addressing modifiability, accuracy, and reproducibility, GPT4Designer provides a robust tool for diverse applications.
Journal cover examples illustrating the high accuracy of GPT4Designer framework.
Conclusion
In this study, we systematically investigate the capability of GPT-4 to generate scientific images. Our findings reveal that while GPT-4 exhibits proficiency in image creation, it encounters notable challenges in ensuring modifiability, accuracy, and reproducibility when using conventional prompt engineering strategies. These challenges mirror the complexities faced by the human brain when translating abstract concepts into concrete visual representations. To address these issues, we develop GPT4Designer, a novel framework that leverages the “detailed prompt + envision-first” pipeline. This approach, inspired by the Chain-of-Thought, significantly reduces the need for prompt modifications and closely aligns the output with the initial concept description, thereby enhancing the precision and consistency of the generated images. The success of GPT4Designer marks a significant leap in AI, emulating the human mind’s ability to transform abstract concepts into detailed visuals, thereby overcoming traditional AI limitations in scientific image generation. This breakthrough, crucial for scientists needing quick and accurate visual tools, paves the way for future AI advancements that blend AI capabilities with human-like thinking to address complex scientific problems.
Future work should focus not just on improving the technical capabilities of AI-driven frameworks, but also address their ethical implications. This includes developing robust tools to verify the authenticity of AI-generated graphics, creating educational resources for researchers on the responsible use of AI tools, and fostering a culture of transparency and accountability in academic and industrial applications. By incorporating these considerations, we can ensure that the use of AI in scientific visualization is both innovative and ethically sound.
Data availability
All data generated or analysed during this study are included in this published article and its supplementary information files.
References
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I. Zero-shot text-to-image generation. in International Conference on Machine Learning. pp 8821–8831 (2021).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. (2021).
Pykes, K. An introduction to using dall-E 3: Tips, examples, and features. https://www.datacamp.com/tutorial/an-introduction-to-dalle3 (2023).
Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W. T., Rubinstein, M., et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704 (2023).
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023).
Avrahami, O., Fried, O. & Lischinski, D. Blended latent diffusion. ACM Trans. Graphics (TOG) 42, 1–11 (2023).
David, E. OpenAI releases third version of dall-e. 2023; https://www.theverge.com/2023/9/20/23881241/openai-dalle-third-version-generative-ai.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Ram, M., Pimcoremkt; yash.mehta262 How to use MidJourney – text to image generation using AI. https://datafloq.com/read/how-use-midjourney-text-to-image (2023).
Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T. MultiDiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113 (2023).
Zhou, H., Wan, X., Proleev, L., Mincu, D., Chen, J., Heller, K., Roy, S. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. arXiv preprint arXiv:2309.17249 (2023).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022).
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
Li, J., Tang, T., Zhao, W. X., Nie, J.-Y., Wen, J.-R. Pretrained language models for text generation: A survey. arXiv preprint arXiv:2201.05273 (2022)
Maddigan, P., Susnjak, T. Chat2vis: generating data visualisations via natural language using chatgpt, codex and gpt-3 large language models. IEEE Access (2023).
Zhang, Z., Yao, Y., Zhang, A., Tang, X., Ma, X., He, Z., Wang, Y., Gerstein, M., Wang, R., Liu, G. et al. Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents. arXiv preprint arXiv:2311.11797 (2023).
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer- Smith, J., Schmidt, D. C. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. (2023).
Lin, S., Wang, W., Yang, Z., Liang, X., Xu, F. F., Xing, E., Hu, Z. Data-to-Text Gener- ation with Style Imitation. Findings of the Association for Computational Linguistics: EMNLP 2020. Online; pp 1589–1598, (2020).
Luo, G., Han, Y. T., Mou, L., Firdaus, M. Prompt-Based Editing for Text Style Transfer. arXiv preprint arXiv:2301.11997 (2023).
Han, M., Zhang, C., Li, F. & Ho, S.-H. Data-driven analysis on immobilized microalgae system: New upgrading trends for microalgal wastewater treatment. Sci. Total Environ. 852, 158514 (2022).
Oppenlaender, J. Prompt engineering for text-based generative art. arXiv preprint arXiv:2204.13988 (2022).
Oppenlaender, J. The creativity of text-to-image generation. in Proceedings of the 25th International Academic Mindtrek Conference. pp 192–202 (2022).
Yang, D. et al. Cocrystal virtual screening based on the XGBoost machine learning model. Chin. Chem. Lett. 34, 107964 (2023).
Zhu, J. et al. Artificial intelligence-aided discovery of prolyl hydroxylase 2 inhibitors to stabilize hypoxia inducible factor-1α and promote angiogenesis. Chin. Chem. Lett. 34, 107514 (2023).
Zhou, J., Wang, T., Yu, C. & Xie, X. Locally enhanced electric field treatment (LEEFT) for water disinfection. Front. Environ. Sci. Eng. 14, 1–12 (2020).
Zhou, J., Hung, Y.-C. & Xie, X. Making waves: Pathogen inactivation by electric field treatment: From liquid food to drinking water. Water Res. 207, 117817 (2021).
Li, Y. et al. Detection of SARS-CoV-2 based on artificial intelligence-assisted smartphone: A review. Chinese Chem. Lett. 35(7), 109220 (2023).
Zhu, J.-J., Yang, M. & Ren, Z. J. Machine learning in environmental research: Common pitfalls and best practices. Environ. Sci. Technol. 57, 17671–17689 (2023).
Zhou, J., Yu, C., Wang, T. & Xie, X. Development of nanowire-modified electrodes applied in the locally enhanced electric field treatment (LEEFT) for water disinfection. J. Mater. Chem. A 8, 12262–12277 (2020).
Zhu, J.-J., Jiang, J., Yang, M. & Ren, Z. J. ChatGPT and environmental research. Environ. Sci. Technol. 57, 17667–17670 (2023).
Zhou, J., Wang, T. & Xie, X. Locally enhanced electric field treatment (LEEFT) promotes the performance of ozonation for bacteria inactivation by disrupting the cell membrane. Environ. Sci. Technol. 54, 14017–14025 (2020).
Mo, F. et al. Decoupling locally enhanced electric field treatment (LEEFT) intensity and copper release by applying asymmetric electric pulses for water disinfection. Water Res. X 21, 100206 (2023).
Zhong, S. et al. Machine learning: New ideas and tools in environmental science and engineering. Environ. Sci. Technol. 55(12741), 12751 (2021).
Acknowledgement
We acknowledge the financial support of the project “Real-time T&O compound detection based on machine learning approaches” from Shenzhen Polytechnic University (Project number: 6025310003K).
Author information
Authors and Affiliations
Contributions
All authors contributed equally to the work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Gao, J., Shi, Y., Wang, R. et al. A methodology for designing accurate, modifiable and reproducible scientific graphics in environmental studies using GPT4Designer. Sci Rep 15, 21643 (2025). https://doi.org/10.1038/s41598-025-00300-2
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-00300-2













