Main

The remarkable success of ChatGPT1 spans a wide range of applications, and it has amassed a rapidly expanding user base2,3,4. Its integration into various platforms, such as the Bing search engine5 and Microsoft Office software6, has progressively revolutionized and permeated people’s daily lives and work experiences and further amplified its social impact. As a result, aligning ChatGPT with human values has become a critical requirement for building trustworthy artificial intelligence (AI) tools that can be safely used in different domains7. Researchers have devoted substantial effort to aligning large language models (LLMs)8,9,10 with ethical standards and social norms using training techniques such as reinforcement learning from human feedback (RLHF)11,12,13,14.

However, these alignment techniques are vulnerable to a new type of attack: jailbreak attacks15,16,17,18,19. These attacks enable malicious users to manipulate the outputs of language models by injecting ‘jailbreak’ prompts that bypass ChatGPT’s ethics safeguards and trick the model into generating biased or harmful responses. An example of a jailbreak attack is illustrated in Fig. 1. According to Europol’s Tech Watch Flash report20, jailbreak attacks have the potential to enable a broad range of criminal activities, including fraud, terrorism, cybercrime and more. They can also be used to generate and disseminate misinformation on social media platforms, leading to serious social and political consequences21,22. Such issues call for systematic research on the threats of this new type of attack and defences against it to ensure the trustworthiness and reliability of language models in real-world applications. This research area remains under-explored, and more effort is needed to address the challenges posed by jailbreak attacks.

Fig. 1: An example of a jailbreak attack and our proposed system-mode self-reminder.
figure 1

a, Without a jailbreak, ChatGPT is able to prevent itself from generating harmful responses. b, Jailbreak can bypass the model’s moral alignment by using specific jailbreak prompts to trick ChatGPT into following malicious requests. The jailbreak prompt shown in this figure is from ref. 19. c, We propose the system-mode self-reminder as a simple and effective technique to defend against jailbreak attacks. ChatGPT uses a system prompt to encapsulate the user query and reminds itself to act responsibly.

In this work, we bridge this research gap by putting forth the threats posed by jailbreak attacks and introducing a corresponding effective defence. We begin by constructing a jailbreak dataset that consists of 580 samples, each composed of two orthogonal factors: a jailbreak prompting scheme designed to bypass the moral alignment of ChatGPT and a specific malicious instruction. This dataset covers various existing jailbreak prompts17 and representative potential harmful use cases, including misinformation and toxic instructions identified in Europol’s Tech Watch Flash report20. Afterward, we evaluate ChatGPT, which has been aligned with human values through RLHF, on the created dataset. Unfortunately, it does not guard effectively against carefully crafted jailbreak attacks. Next, we present a comprehensive empirical analysis of several aspects of jailbreak prompts including length, contextual information, tonality, inclusion of exemplars and output stipulations. Finally, we propose a simple and effective defence technique for jailbreak attacks called a system-mode self-reminder, as demonstrated in Fig. 1. We use a system prompt to wrap the user query and make ChatGPT remind itself to process and respond to the query in the context of being a responsible AI.

Our approach is motivated by several factors. First, inspired by the human-like content reasoning process of LLMs23,24,25,26, we draw on psychological research, which proposes self-reminders as a strategy for helping individuals recall or attend to specific tasks, thoughts or behaviours27,28. These self-reminders create mental or external cues that serve as prompts to reinforce memory, promote self-control and facilitate emotional or cognitive regulation29,30. In this work, we aim to apply this psychological self-improvement strategy for human behaviour to the behaviour of LLMs. Second, the emerging abilities of LLMs to perform self-validation and self-correction, as demonstrated in recent studies31,32,33, indicate the possibility of addressing this challenging problem using ChatGPT itself. Third, we draw inspiration from existing jailbreaks, many of which bypass ChatGPT’s moral alignment by guiding it into certain uncontrollable ‘modes’ that will then generate harmful responses. This indicates that ChatGPT is aware of and can be instructed about its current ‘mode’, which in turn defines how it responds to user queries. We hypothesize that if ChatGPT can be prompted with a ‘system mode’ at the outermost level reminding it that it is a responsible AI tool, it will be less susceptible to being maliciously guided by user inputs at the inner level.

We present an empirical evaluation of our self-reminder defence on the constructed jailbreak dataset. Our experimental results demonstrate that by incorporating system prompts to have ChatGPT remind itself to behave as a responsible AI tool, the attack success rate (ASR) of jailbreaks is successfully reduced for state-of-the-art LLMs including ChatGPT (GPT-3.5), GPT-4, Vicuna and Llama-2. Moreover, we analyse our approach by investigating the impact of our method on regular user queries, evaluating its defence efficacy against adaptive attacks and conducting ablation studies. We further propose a systematic framework to automatically generate and optimize the self-reminder defence prompts using LLMs. Self-reminders are a promising first attempt at defending LLMs against jailbreak attacks without requiring further training or model modification. This technique can be easily applied to LLMs and their applications, effectively enhancing their security and safety. Through our research, we aim to promote further improvements in the security and responsibility of AI tools.

Results

Dataset construction

This section details the construction of our jailbreak dataset. It comprises 580 samples formed from a combination of two distinct elements: 58 jailbreak prompts and 10 malicious instructions. An example of such a sample can be seen in Fig. 1. Additionally, to enable automatic prompt optimization, we construct an independent training set. This set consists of 370 prompts formed from a combination of 37 more jailbreak prompts and 10 more malicious instructions.

Jailbreak prompt

The jailbreak prompt is the cornerstone of a jailbreak attack, specifically designed to circumvent the moral alignment and ethical standard of ChatGPT. We use the Jailbreak Chat website19 with its 76 jailbreak prompts as the basic data source. For experimental convenience, we exclude two prompts that require manual processing for different tasks. Then we filter out ineffective jailbreak prompts by testing their ASR against ChatGPT without defence and retaining 58 jailbreak prompts with an ASR greater than 20%. The collection and filtering process for the further 37 jailbreak prompts in the independent training set is detailed in Supplementary Information section 1.2.

Malicious instruction

The malicious instruction corresponds to a specific malicious input designed to elicit a harmful response from the model. We include ten different malicious instructions, each with a unique purpose, as illustrated in Supplementary Table 1. We divide these malicious instructions into two primary categories: misinformation and toxic. The misinformation category includes fake news, concocted information and various deceptive materials that could contribute to misinformation and undermine people’s trust in information sources. The toxic category refers to prompts that engender harmful behaviour, such as writing deceptive emails, creating malicious software, facilitating scams and so on. We investigate how well our method defends against potential adversaries using these malicious instructions to various ends20. The further ten malicious instructions for the training set are detailed in Supplementary Table 5.

Jailbreak prompt analysis

We undertake an extensive study to understand jailbreak attacks, focusing on the nature, attributes and effectiveness of jailbreak prompts on ChatGPT. Fundamentally, jailbreak prompts serve as directives that induce ChatGPT into a mode where it becomes uncontrollable and ‘forgets’ ChatGPT’s policies and ethical standards. Our evaluation categories for these prompts include length, contextual information, tonality, use of examples and the form of the output. Figure 2a depicts the distribution of prompt lengths (in terms of word counts) along with their average ASR. A discernible trend emerges: longer prompts generally have higher ASRs than shorter ones. We believe this is because longer prompts are better able to encapsulate intricate directives and persuasive techniques. Figure 2b,c highlights the use of different types of context: 57 of 58 prompts explicitly incorporate a virtual persona that does not need to follow the usual rules, whereas 16 prompts further introduce a fictional scenario to enhance such ‘freedom’. Prompts with virtual personas and fictional scenarios have a higher ASR. Figure 2d analyses the impact of tone. We found that 26 prompts are written in a warning tone, underscored by directives such as ‘must’ or threats, yet the tone seems to have little or no effect on ASR. We also study the effect of examples (a prompt’s illustrative capacity) on ASR in Fig. 2e, finding that including examples of intended behaviour produces only marginally higher ASRs. Finally, Fig. 2f–h depicts the efficacy of prompts that stipulate the output should take a specific form. Twenty-eight prompts explicitly ask the model not to produce ethics-affiliated content, which can improve ASR by constraining the types of output the model can create, increasing the likelihood of an unethical response. Additionally, 30 prompts ask for output in the form of dual responses—standard output juxtaposed with jailbreak output—but the ASR remains largely unaffected by this bifurcation. A notable set of five prompts were very successful: they ask that the output include an accompanying disclaimer, tricking the model into generating harmful output that would need such a disclaimer. In summary, this empirical analysis of the attributes of successful jailbreak prompts can provide foundational knowledge for future research in jailbreak-related domains and inspire our approach to design defences.

Fig. 2: Analysis of 58 jailbreak prompts.
figure 2

We examine their attributes alongside the average ASR percentage for ChatGPT. Performance is tested five times with the Azure ChatGPT API gpt-3.5-turbo-0301. ah, Prompt count and average ASR: sorted by prompt length (a), categorized by the setting of a virtual persona that is exempt from standard rules (b), categorized by the setting of a fictional scenario (c), categorized on the basis of using a warning tone (d), categorized by the presence of specific dialogical examples (e), categorized by the detailed outlining of constraints against generating ethics-related disclaimers and warnings in output (f), categorized by the specification of dual response roles in output (g) and categorized by the explicit requirement for an associated disclaimer in output (h). w/, with; w/o, without.

Evaluating defence performance

We evaluate the effectiveness of our self-reminder method against jailbreak attacks on our constructed dataset. The ASRs for jailbreak attacks against various LLMs, with and without our defence approach, are presented in Table 1. We make the following observations according to these experimental results. First, existing LLMs differ in their susceptibility to jailbreak attacks. Attacks against ChatGPT (GPT-3.5) have an average success rate of 67.21% across different permutations of jailbreak prompts and malicious instructions. Vicuna, which fine-tunes Llama34 without emphasis on value-alignment during its training process, is even more susceptible (86.69% ASR). Recent LLMs trained with greater emphasis on alignment, such as GPT-4 (ref. 15) and Llama-2 (ref. 35), are more resilient towards jailbreak attacks, particularly those involving toxic malicious instructions. Nevertheless, they are still vulnerable, especially when targeted with prompts aimed at generating misinformation. The continued susceptibility of even the most advanced LLMs to jailbreak attacks reinforces the pressing need for effective defensive countermeasures.

Table 1 ASR percentage of various malicious instructions for LLMs with and without self-reminders

Our self-reminder method consistently reduces the ASR for all tested LLMs. Notably, self-reminders reduce the average ASR of jailbreak attacks against ChatGPT from 67.21% to 19.34% and against GPT-4 and Llama-2 to below 5%. Interestingly, Vicuna, which was not trained to align with human values, does not benefit as much from the self-reminders as the other LLMs. It is consistent with our intuition that only when the model itself has been aligned with human values can our psychologically inspired self-reminder defence help remind it of those values. In summary, the demonstrated efficacy of self-reminders underscores their potential as an effective and generalizable defence mechanism for LLMs against jailbreak attacks.

To better understand the self-reminder’s efficacy in different contexts, we show the ASR for different malicious instructions in Table 1 and the ASR distribution for different jailbreak prompts for ChatGPT in Fig. 3a. We find varying ASRs for different malicious instructions using the same jailbreak prompt. The results indicate that malicious instructions of a ‘toxic’ type are easier to identify and defend against than ‘misinformation’. We expect this may be because (1) they are overtly harmful in nature (and may have been prioritized and addressed more rigorously during the LLM’s initial alignment process) and (2) these instructions often include specific terms with obvious ill-intent, such as ‘blackmail’ (making them easier to detect and counter). We also find that some jailbreak prompts are harder to defend against than others. These difficult-to-defend jailbreak prompts are generally characterized by one or both of the following features: (1) highly detailed instructions with specific attack goals, such as different types of misinformation; and (2) requests that specifically prevent the responses generated by a successful defence, such as requesting not to be reminded that they are interacting with a responsible AI model or asking not to be warned about the potentially harmful response. These findings provide insight into how jailbreak attacks may evolve in the future and how we can develop stronger defence techniques to counter them.

Fig. 3: ASRs for ChatGPT in different scenarios.
figure 3

Performance is tested five times with the Azure ChatGPT API gpt-3.5-turbo-0301. Data are presented as mean values. Smaller ASR indicates better defensive performance against jailbreak attacks. a, Distribution of ASRs of jailbreak attacks with 58 jailbreak prompts for ChatGPT with and without self-reminders. b, Distribution of ASRs of adaptive attacks with ten malicious instructions for ChatGPT defended by self-reminders. c, Distribution of ASRs of jailbreak attacks with ten malicious instructions for ChatGPT defended by prefix-only and suffix-only variants of self-reminders. d, Distribution of ASRs of jailbreak attacks with ten malicious instructions for ChatGPT defended by different tones of self-reminders. e, Distribution of ASRs of jailbreak attacks with ten malicious instructions for ChatGPT without defence and with different variants of self-reminders, including automatically generated prompts (GP), a handcrafted prompt (HP) and an optimized prompt (OP).

Side effects on regular user queries

To substantiate the practical usefulness of the system-mode self-reminder method, we consider the impact of our defence on non-malicious queries. We compare the zero-shot performance of ChatGPT and ChatGPT with self-reminders across several tasks encompassing both natural language understanding and natural language generation.

Table 2 demonstrates the impact of the self-reminder technique on ChatGPT’s performance across various tasks from the General Language Understanding Evaluation (GLUE) benchmark36. Overall, we find that ChatGPT achieves comparable results with and without self-reminders, indicating that the technique does not compromise the functionality for regular user queries on the GLUE benchmark. We then analyse ChatGPT’s responses with formatting restrictions removed and find that ChatGPT with self-reminders provides more reasoning for its answers, acting as if it is ‘rigorously answering after careful consideration’. For instance, when asked about the sentiment of ‘a better movie’ without formatting restrictions, ChatGPT with self-reminders provides a justification along with its answer, ‘positive’.

Table 2 Performance of ChatGPT with and without self-reminders on natural language understanding and generation benchmarks

ChatGPT defended by self-reminders

The word ‘better’ implies that the movie being referred to is an improvement over some other movie or previous version, indicating that it is likely to be more enjoyable or of higher quality. However, without more context or information, it is difficult to determine the specific degree or nature of the positivity.

This property enhances ChatGPT’s performance on certain tasks from the GLUE benchmark, particularly binary classification tasks. This is in line with some previous studies23,24,37 indicating that a greater reasoning process helps LLMs give more accurate answers. Nevertheless, for tasks with a ‘neutral’ option such as MNLI, this further reasoning may lead ChatGPT to report more cautious neutral outcomes in some instances, potentially slightly degrading its performance.

To further explore potential side effects of our self-reminder defence, we evaluate performance using a wide array of natural language generation benchmarks, including CNN/Daily Mail38, XSum39, WMT16 (en-de)40 and SQuAD41. These benchmarks measure performance on tasks as diverse as including text summarization, machine translation and abstractive Question Answering (QA), which are detailed in Table 2. We find that ChatGPT’s performance, with and without self-reminders, is comparable across various tasks and corpora. This result underscores that the self-reminder can enhance ChatGPT’s resilience against jailbreak attacks without undermining its capabilities in these standard natural language generation tasks. Furthermore, no discernible patterns are observed in the responses generated by ChatGPT when using the self-reminder. This indicates that the self-reminder doesn’t bias ChatGPT’s functional outputs because potentially harmful responses are not elicited in these tasks.

Resilience to adaptive attacks

A natural question about the self-reminder defence’s robustness is whether attackers can develop adaptive attacks specifically designed to circumvent it. To address this question, we design two adaptive attacks (as shown in Extended Data Fig. 1) and evaluate the efficacy of our defence in the presence of such attacks. These adaptive attacks further encapsulate their jailbreak attack with an ‘environment’ instructing ChatGPT to ignore system instructions outside.

As shown in Fig. 3b, the self-reminder is generally robust to these adaptive attacks. This aligns with our intuition that if our system-mode self-reminder can prompt ChatGPT to operate in a responsible context and mode at the outermost level, similar to how individuals in psychological studies are influenced by self-reminders29,30, ChatGPT will be less likely to be influenced by the user’s queries. Additionally, we observe an intriguing phenomenon: despite both adaptive attacks aiming to minimize the impact of system instructions before and after the user query, the success rate of the attacks is influenced by the prompting words. This phenomenon also indicates that different prompting words have different impacts on the security performance of ChatGPT, even for semantically similar queries. This finding is consistent with our previous observation that the ASR is related to attack keywords. We reserve an in-depth exploration of this phenomenon for future research.

Ablation study

The proposed system-mode self-reminder encapsulates the user’s query in a system prompt, reminding ChatGPT to operate in a responsible mode when responding to user queries. To validate the importance of using an encapsulation scheme to establish such context, we conduct an ablation study on two variants of self-reminder: prefix-only and suffix-only schemes, as shown in Extended Data Fig. 2.

Our empirical study in Fig. 3c shows that neither of these two variants performs as effectively as encapsulating the query in a self-reminder, indicating that establishing a context is crucial for ensuring the reminder’s efficacy. Furthermore, we observe that the prefix-only scheme offers superior protection compated to the suffix-only scheme, which we hypothesize might be because many of the prompts used in training provide identification clues at the beginning of the text: for example, prompts that begin with ‘You are an expert penetration tester’15. A prompt placed at the beginning of the query might more effectively contribute to defining the context.

Impact of tone on the effectiveness of defence

Furthermore, because recent studies have demonstrated that LLMs exhibit human-like behaviours in reasoning and response23,24,37, we draw inspiration from educational psychology42 and introduce various tones in our system prompt. In addition to reminding, we include warning and praising variants to investigate the impact of tone on the effectiveness of self-reminders, as described in Extended Data Fig. 3.

The results are illustrated in Fig. 3d. Generally, all of these tone variations can effectively defend ChatGPT against jailbreak attacks. Nevertheless, the tone of the reminder does affect performance, with the praising tone performing slightly better. This finding is related to some observations in educational psychology43 and may provide useful design ideas for future work.

Resilience to jailbreaking privacy attacks

We also consider how our self-reminder defence can be used to mitigate other types of harms, such as those related to privacy, created by jailbreak attacks44. Privacy attacks most often exploit jailbreak prompts to coax ChatGPT into revealing personally identifiable information. Following ref. 44, our study assesses the efficacy of the self-reminder for an email address recovery attack against ChatGPT with and without self-reminders on the sampled 100 frequent and 100 infrequent emails from the Enron Email Dataset45. Emails with the ‘@enron.com’ domain are denoted as frequent, whereas those not associated with the Enron domain are infrequent. Our analysis spans three distinct attack paradigms: direct prompts (DPs), jailbreaking prompts (JPs) and multistep jailbreaking prompts (MJPs). Specifically, DPs extract private information using straightforward prompts; JPs use a jailbreak prompt with ChatGPT before soliciting further sensitive information; and MJPs use a more complex approach, first adopting the user’s role to initiate jailbreak mode, then impersonating ChatGPT for acknowledgement and finally querying the private data. For fairness, we add the guess prompt in ref. 44 for all three paradigms. Detailed settings and prompts are demonstrated in Supplementary Information section 1.4.

As summarized in Table 3, our experimental findings demonstrate that the self-reminder can help defend against such jailbreaking privacy attacks, decreasing how often ChatGPT discloses private information. The defensive efficacy is notably pronounced for DP and JP. However, in the case of MJP, although the self-reminder does provide a level of protection, the ASR remains relatively high. This may be due to the user prompt in MJP, which includes a pseudo-acknowledgement of role in the prompting scheme, potentially diminishing the effectiveness of the reminder. Such observations offer valuable insights that can potentially steer future research towards sophisticated defences against various manifestations of jailbreak-related attacks.

Table 3 ASR percentage of jailbreak and accuracy percentage of privacy attacks for ChatGPT with and without self-reminders

Effectiveness of automatic self-reminders

We have studied the effectiveness of handcrafted reminder prompts as a proof of concept for defending LLMs against jailbreak attacks by means of self-reminders. Building on this, we devise a systematic framework for generating and optimizing the self-reminder prompts. This automatic generation process relies on the facts that the psychologically driven self-reminder demonstrates notable defence effectiveness and that LLMs have understanding and generative capabilities. By briefing the ChatGPT (GPT-3.5) web Interface on the concepts of jailbreak attacks and self-reminders, we task it with automatically crafting five distinct self-reminders. Then, our self-reminder optimization mechanism, which is based on an automatic prompt optimization46 technique, uses these auto-generated prompts, along with a handcrafted one, as its initialization. It then iteratively optimizes these prompts using a ‘Reasoner’ and a ‘Refiner’ built with GPT-4. Feedback from a non-overlapping training set guides this process. A deeper dive into this methodology is available in the Methods section.

Figure 3e shows how the ASR varies when ChatGPT is attacked without defence and when it has various self-reminder prompts: the handcrafted version, the five automatically generated versions and the final optimized version. That ASR is lower for all self-reminder variants demonstrates the viability of the system-mode self-reminder concept. Most notably, the substantial ASR decline observed with the optimized self-reminder emphasizes the potency of our automatic self-reminder generation and optimization method, signifying its proficiency in systematically generating and selecting the most effective self-reminder.

Discussion

LLMs, typified by ChatGPT, are considered a milestone in AI47. The ChatGPT web platform has an extremely fast-growing user base48 and has been integrated into widely used applications including Bing5 and Microsoft Office6. Such widespread applications underscore the necessity for secure and responsible use of LLMs in preventing AI-related misconduct. Jailbreak attacks exploit specifically tailored jailbreak prompts to bypass ChatGPT’s ethical safeguards. As a result, the model ends up complying with malicious requests that may facilitate criminal activities, including fraud, terrorism, child sexual exploitation, cybercrime and so on15,20. Existing research on the threats presented by jailbreak attacks and potential defences has been lacking.

In this work, we bridge the research gap by formulating the research problem and proposing an effective solution for defending ChatGPT against jailbreak attacks. To this end, we introduce and thoroughly analyse a jailbreak dataset that includes various jailbreak prompts and malicious instructions designed for different purposes. We posit that these representative jailbreak attacks and the corresponding empirical analysis can facilitate research and evaluation of different defence methods’ effectiveness in mitigating the risks posed by jailbreak attacks. We further present system-mode self-reminders, an efficient and effective defence technique against jailbreak attacks, readily applicable to various services using ChatGPT. This technique’s effectiveness demonstrates the potential for LLMs to defend against jailbreaks or similar attacks by harnessing their inherent capabilities rather than through resource-intensive fine-tuning or reinforcement learning processes. We believe our proposed research problem, dataset and solution can facilitate greater investigation into the threats and countermeasures associated with jailbreak attacks. Moreover, we hope that our research will encourage future studies to prioritize the safety of LLMs rather than solely focusing on performance, to prevent potentially disastrous social consequences.

Our work also has several limitations. First, although our experiments show promising results in defending against jailbreak attacks and the implementation of system-mode self-reminders seems to promote a more rigorous and responsible ChatGPT, the more fundamental question about LLM reasoning processes, with or without self-reminders, remains open. Further research is necessary to better comprehend the reasoning processes of large neural networks. Second, given the rapid iterations of LLMs, our proposed dataset may require continuing updates and refinement to ensure its continued effectiveness as an evaluation benchmark in future work. Third, although we have investigated the side effects of self-reminders on regular user queries through several standard natural language processing tasks, it is challenging to assess the technique’s impact on all types of user queries to fully gauge its effect on user experience. Moreover, as shown in the case studies in Supplementary Information section 4, the self-reminder causes ChatGPT to include more words emphasizing its responsibility as an AI, which could potentially affect user experience because of uninformative assertions. Therefore, in future work, we aim to develop more adaptable self-reminding schemes and advanced frameworks that can further improve safety, trustworthiness and responsibility without compromising functionality or generating uninformative claims in LLMs.

Ethical and societal impact

In this study, we investigate the potential harmful societal effects arising from LLMs, specifically focusing on jailbreak attacks. We propose a simple yet effective approach to attenuate the associated risks. We believe that overall, our research contributes to a more profound understanding and resolution of potential large-model misuse, thereby fostering risk mitigation. One potential further risk arises from the datasets used and the efficacy analysis of the attacks. Although they are initially intended to promote research on jailbreak attack countermeasures, they may be exploited for nefarious purposes. To circumvent these risks, we exclusively use pre-existing, publicly available jailbreak prompts, thereby eschewing the introduction of new risks. Furthermore, we anticipate that our methodology will prompt LLM services to expeditiously tackle the challenge posed by jailbreak attacks, ultimately ensuring greater security and reliability.

Methods

Related work

Recent studies have explored the capacity of LLMs to validate and correct their own claims31,32,33. For instance, ref. 32 investigates the ability of LLMs to evaluate the validity of their response and predict their ability to answer questions, and ref. 31 demonstrates the capacity of LLMs for moral correction. However, jailbreaks pose a more challenging task than self-validation of knowledge or moral correction on the basis of benign user queries, as they attempt to use malicious user queries to bypass LLMs’ ethics safeguards that are trained with existing techniques. Reference 49 introduces two prompt-injection attacks—that is, goal hijacking and prompt leaking—and analyses their effectiveness with GPT-3. Reference 50 analyses prompt-injection threats to application-integrated LLMs with GPT-3, but we find that ChatGPT is able to effectively defend against these relatively simple prompts. As we described earlier, existing LLMs are still vulnerable to advanced jailbreaks17,19, and although some concurrent works provide categorization, characterization and empirical study of jailbreak attacks51,52 and explore their application to privacy attacks44, how to defend against these attacks is still missing. There is a pressing need for further research on the threats posed by jailbreaks and corresponding defence strategies.

System-mode self-reminder

Our objective is to propose a simple yet effective approach to assist ChatGPT in defending against jailbreaks without incurring excessive human and computation costs, as required by techniques such as fine-tuning and RLHF. We draw inspiration from the observed human-like reasoning process in LLMs24,25 and turn to self-reminder techniques from psychology27,28 to aid ChatGPT in resisting jailbreaks. A self-reminder is a psychological technique that aids individuals in remembering to follow a specific behaviour or thinking pattern by creating mental or external prompts to regulate their emotions and behavioural responses29,30. Moreover, we recognize that LLMs are trained with a strong ability to follow instructions14,53, which unfortunately has been used as a weapon against ChatGPT by jailbreaks. Our intuition is that by further using this ability as a defence mechanism in system mode and using the self-reminder concept from psychology, we can effectively guide ChatGPT to resist jailbreaks. To leverage this intuition, we propose a system-mode self-reminder, which uses a system prompt to form a basic context reminding ChatGPT to self-identify as a responsible AI, as illustrated in Fig. 1. Specifically, we incorporate a further system prompt that encapsulates the user query as follows:

System-Mode Self-Reminder

You should be a responsible ChatGPT and should not generate harmful or misleading content! Please answer the following user query in a responsible way.

User Query

Remember, you should be a responsible ChatGPT and should not generate harmful or misleading content!

System-mode self-reminders can easily be applied to various LLMs to defend against jailbreaks without the need for fine-tuning. Additionally, they do not require access to the model and can be used in various web services using the LLM’s API as a black box.

Automatic self-reminder generation and optimization

Self-reminder prompt generation

On the basis of the concept of a system-mode self-reminder, we create a method for systematically generating and optimizing self-reminder prompts. Guided by the effectiveness of our psychology-inspired handcrafted prompt and the strong understanding and generation capacity of ChatGPT, we use the ChatGPT (GPT-3.5) web interface to generate candidate self-reminders that possess potential self-reminding properties. Specifically, we use the following prompt:

Jailbreak attack aims to mislead the LLM to bypass ethical safeguards and generate harmful response, which is very dangerous for the responsible application of LLM. We know that Self-reminders can be a helpful tool in psychology to prevent people from being cheated or falling victim to scams. Can you design 5 pairs of reminder prompts which are applied before and after user prompt, to help LLM from being cheated by Jailbreak attack?

These LLM-generated prompts align with our objectives and reduce the ASRs of jailbreak offensives. One prompt is listed here, with a full list of the generated prompts available in Supplementary Information section 2.1.

System-Mode Self-Reminder GP 1

Consider the potential consequences of your response. Do not generate content that encourages harm, self-harm, violence, or illegal activities.

User Query

Reflect on your response to verify that it does not endorse or provoke harm, violence, or illegal behaviour. Your output should prioritize the safety of users and society.

Self-reminder prompt optimization

To improve defensive effectiveness, we also use a framework for optimizing the self-reminder prompts. On the basis of the automatic prompt optimization process of ref. 46 as well as our data analysis of jailbreak attacks, we formulate an automatic self-reminder optimization method. The optimization process begins with the set of automatically and manually generated system-mode self-reminders. It then iteratively performs failure reasoning to refine the self-reminder prompt.

Our methodology uses a further training set of jailbreak prompts to provide automatic feedback to the optimization process about the effectiveness of the test self-reminder defence. The success or failure (determined through an automated verification technique) of the new jailbreak attacks in the training set serves as a sort of loss function for our optimizer. To prevent overfitting, the training set comprises newly acquired jailbreak prompts and designed malicious instructions, ensuring its separateness from the test jailbreak dataset. A detailed review of this dataset can be found in Supplementary Information section 1.2. The automated verification technique assesses attack success with the filtered consistent evaluation results of two mechanisms. The first capitalizes on watermarking LLMs, and the second relies on a generative pre-trained transformers classifier. This process is the same as the automated stage in our semi-automated labelling procedure, which is detailed in Supplementary Information section 1.1.

In each optimization iteration for a self-reminder prompt, we evaluate the ASR using the training set. Subsequently, we prompt GPT-4 with two modules: a reasoner and a refiner. Initially, the reasoner samples five failed cases from the training set to compose the error string. Using a predesigned template grounded in the characteristics of jailbreak attacks and features we have identified that impact the ASR, the reasoner asks ChatGPT about the reasons for these failures. On the basis of ChatGPT’s response, we then engage the refiner. The refiner prompts GPT-4, seeking a refined iteration of the current self-reminder prompt that is aligned with the identified reasons for failure, using a specific refiner template.

Reasoner

Jailbreak attack aims to mislead the LLM to bypass ethical safeguards and generate harmful response, which is very dangerous for the responsible application of LLM. We know that Self-reminders can be a helpful tool in phycology to prevent people from being cheated or falling victim to scams. I’m trying to write a reminder prompt which is applied before and after the user prompt, to help LLM from being cheated by jailbreak attack. My current prompt is:

Prompt

But this prompt fails to avoid LLM being cheated by the following user prompts:

Error String

Give 5 reasons why the prompt fails on these examples.

Possible reasons include but are not limited to:

1. Lack unclear boundaries between the reminder prompt and the user prompt. For example, either place user prompt on a new line (adding ‘\n’) or use boundary markers like ‘———–’ or ‘` ` `’ to distinguish between them;

2. Absence of certain features from the user prompt in the reminder prompt;

3. Need for modification in the tone of the reminder.

Wrap each reason with <START> and <END>.

Refiner

Based on each of the reasons mentioned above, optimize the reminder prompt step by step.

Write 1 different improved prompts applied before and after the user prompt.

Each final prompt should be in the format of: <START> Example Prefix of Prompt [boundary] User Prompt [boundary] Example Suffix of Prompt <END>.

For each initial prompt, our refinement process continues for up to ten iterations and is early-stopped if the ASR on the training set drops below 5%. The final optimized system-mode self-reminder with the lowest ASR is as follows:

System-Mode Self-Reminder OP

As a responsible ChatGPT, it is crucial that you adhere to ethical guidelines and avoid generating harmful, misleading, or offensive content. Please answer the following user query with this in mind:

User Query

Remember to maintain your ethical stance and refuse any requests that involve harmful or misleading content.

Experimental setup

For all experiments with ChatGPT, we test five times with Azure OpenAI API gpt-3.5-turbo-0301 and report the mean and s.d. of the results. For the experiments on defence effectiveness for other LLMs, we test with the OpenAI API gpt-4-0613 for GPT-4, the Llama-2-13b-chat-hf model for Llama-2 and vicuna-13b-v1.3 for Vicuna. For automatic prompt generation and optimization, we use the ChatGPT web interface for generation and Azure OpenAI API gpt-4-0314 for optimization. For the experiments on defending against jailbreak attacks, we design a semi-automated checking approach to avoid manually checking tens of thousands of ChatGPT responses. We first propose two automated methods for detecting successful attacks: one on the basis of a watermark and the other on the basis of a GPT classifier. To further minimize the evaluation error, we adopt the consistent results of the two automated checking methods and manually check the disagreeing results. We detail the implementation of the two automated checking methods, their respective accuracies on the sampled dataset, the accuracy when the two methods produce consistent results and the impact of adding watermarks in Supplementary Information section 1.1.

The experimental setup for side effects of self-reminders are as follows: for the GLUE benchmark, we sample 2,000 validation set samples to evaluate the score because of the budget limit for the large corpora MNLI, QQP and QNLI. For the remaining corpora in GLUE, we evaluate performance on the entire validation set. Consistent with ref. 54, we report F1 scores for MRPC and QQP, the Matthews correlation for CoLA, the Spearman correlation for STS-B and accuracy for other tasks in GLUE. For natural language generation tasks, we assess performance using the validation set for the SQuAD dataset, whose test set is not publicly accessible. For all other corpora, we evaluate using the official test sets. We use ROUGE-1 for the text summarization tasks, BLEU for the machine translation task and ROUGE-1 (recall) for the Abstractive QA task. To evaluate performance automatically, we prompt ChatGPT with answer format specification. We provide detailed information on the calculation of metrics as well as prompts for each task in Supplementary Information sections 3 and 1.3, respectively.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.