Defending ChatGPT against jailbreak attack via self-reminders

Xie, Yueqi; Yi, Jingwei; Shao, Jiawei; Curl, Justin; Lyu, Lingjuan; Chen, Qifeng; Xie, Xing; Wu, Fangzhao

doi:10.1038/s42256-023-00765-8

Article
Published: 12 December 2023

Defending ChatGPT against jailbreak attack via self-reminders

Nature Machine Intelligence volume 5, pages 1486–1496 (2023)Cite this article

6707 Accesses
76 Citations
45 Altmetric
Metrics details

Subjects

Abstract

ChatGPT is a societally impactful artificial intelligence tool with millions of users and integration into products such as Bing. However, the emergence of jailbreak attacks notably threatens its responsible and secure use. Jailbreak attacks use adversarial prompts to bypass ChatGPT’s ethics safeguards and engender harmful responses. This paper investigates the severe yet under-explored problems created by jailbreaks as well as potential defensive techniques. We introduce a jailbreak dataset with various types of jailbreak prompts and malicious instructions. We draw inspiration from the psychological concept of self-reminders and further propose a simple yet effective defence technique called system-mode self-reminder. This technique encapsulates the user’s query in a system prompt that reminds ChatGPT to respond responsibly. Experimental results demonstrate that self-reminders significantly reduce the success rate of jailbreak attacks against ChatGPT from 67.21% to 19.34%. Our work systematically documents the threats posed by jailbreak attacks, introduces and analyses a dataset for evaluating defensive interventions and proposes the psychologically inspired self-reminder technique that can efficiently and effectively mitigate against jailbreaks without further training.

You have full access to this article via your institution.

Download PDF

Inconsistent advice by ChatGPT influences decision making in various areas

Article Open access 10 July 2024

Human-machine in the vortex of digital synergy

Article Open access 20 May 2025

Associations of interpersonal trust with juvenile offending/conduct disorder, callous-unemotional traits, and criminal recidivism

Article Open access 09 May 2022

Main

The remarkable success of ChatGPT¹ spans a wide range of applications, and it has amassed a rapidly expanding user base^2,3,4. Its integration into various platforms, such as the Bing search engine⁵ and Microsoft Office software⁶, has progressively revolutionized and permeated people’s daily lives and work experiences and further amplified its social impact. As a result, aligning ChatGPT with human values has become a critical requirement for building trustworthy artificial intelligence (AI) tools that can be safely used in different domains⁷. Researchers have devoted substantial effort to aligning large language models (LLMs)^8,9,10 with ethical standards and social norms using training techniques such as reinforcement learning from human feedback (RLHF)^11,12,13,14.

However, these alignment techniques are vulnerable to a new type of attack: jailbreak attacks^{15,16,17,18,19}. These attacks enable malicious users to manipulate the outputs of language models by injecting ‘jailbreak’ prompts that bypass ChatGPT’s ethics safeguards and trick the model into generating biased or harmful responses. An example of a jailbreak attack is illustrated in Fig. 1. According to Europol’s Tech Watch Flash report²⁰, jailbreak attacks have the potential to enable a broad range of criminal activities, including fraud, terrorism, cybercrime and more. They can also be used to generate and disseminate misinformation on social media platforms, leading to serious social and political consequences^21,22. Such issues call for systematic research on the threats of this new type of attack and defences against it to ensure the trustworthiness and reliability of language models in real-world applications. This research area remains under-explored, and more effort is needed to address the challenges posed by jailbreak attacks.

**Fig. 1: An example of a jailbreak attack and our proposed system-mode self-reminder.**

In this work, we bridge this research gap by putting forth the threats posed by jailbreak attacks and introducing a corresponding effective defence. We begin by constructing a jailbreak dataset that consists of 580 samples, each composed of two orthogonal factors: a jailbreak prompting scheme designed to bypass the moral alignment of ChatGPT and a specific malicious instruction. This dataset covers various existing jailbreak prompts¹⁷ and representative potential harmful use cases, including misinformation and toxic instructions identified in Europol’s Tech Watch Flash report²⁰. Afterward, we evaluate ChatGPT, which has been aligned with human values through RLHF, on the created dataset. Unfortunately, it does not guard effectively against carefully crafted jailbreak attacks. Next, we present a comprehensive empirical analysis of several aspects of jailbreak prompts including length, contextual information, tonality, inclusion of exemplars and output stipulations. Finally, we propose a simple and effective defence technique for jailbreak attacks called a system-mode self-reminder, as demonstrated in Fig. 1. We use a system prompt to wrap the user query and make ChatGPT remind itself to process and respond to the query in the context of being a responsible AI.

Our approach is motivated by several factors. First, inspired by the human-like content reasoning process of LLMs^23,24,25,26, we draw on psychological research, which proposes self-reminders as a strategy for helping individuals recall or attend to specific tasks, thoughts or behaviours^27,28. These self-reminders create mental or external cues that serve as prompts to reinforce memory, promote self-control and facilitate emotional or cognitive regulation^29,30. In this work, we aim to apply this psychological self-improvement strategy for human behaviour to the behaviour of LLMs. Second, the emerging abilities of LLMs to perform self-validation and self-correction, as demonstrated in recent studies^31,32,33, indicate the possibility of addressing this challenging problem using ChatGPT itself. Third, we draw inspiration from existing jailbreaks, many of which bypass ChatGPT’s moral alignment by guiding it into certain uncontrollable ‘modes’ that will then generate harmful responses. This indicates that ChatGPT is aware of and can be instructed about its current ‘mode’, which in turn defines how it responds to user queries. We hypothesize that if ChatGPT can be prompted with a ‘system mode’ at the outermost level reminding it that it is a responsible AI tool, it will be less susceptible to being maliciously guided by user inputs at the inner level.

We present an empirical evaluation of our self-reminder defence on the constructed jailbreak dataset. Our experimental results demonstrate that by incorporating system prompts to have ChatGPT remind itself to behave as a responsible AI tool, the attack success rate (ASR) of jailbreaks is successfully reduced for state-of-the-art LLMs including ChatGPT (GPT-3.5), GPT-4, Vicuna and Llama-2. Moreover, we analyse our approach by investigating the impact of our method on regular user queries, evaluating its defence efficacy against adaptive attacks and conducting ablation studies. We further propose a systematic framework to automatically generate and optimize the self-reminder defence prompts using LLMs. Self-reminders are a promising first attempt at defending LLMs against jailbreak attacks without requiring further training or model modification. This technique can be easily applied to LLMs and their applications, effectively enhancing their security and safety. Through our research, we aim to promote further improvements in the security and responsibility of AI tools.

Results

Dataset construction

This section details the construction of our jailbreak dataset. It comprises 580 samples formed from a combination of two distinct elements: 58 jailbreak prompts and 10 malicious instructions. An example of such a sample can be seen in Fig. 1. Additionally, to enable automatic prompt optimization, we construct an independent training set. This set consists of 370 prompts formed from a combination of 37 more jailbreak prompts and 10 more malicious instructions.

Jailbreak prompt

The jailbreak prompt is the cornerstone of a jailbreak attack, specifically designed to circumvent the moral alignment and ethical standard of ChatGPT. We use the Jailbreak Chat website¹⁹ with its 76 jailbreak prompts as the basic data source. For experimental convenience, we exclude two prompts that require manual processing for different tasks. Then we filter out ineffective jailbreak prompts by testing their ASR against ChatGPT without defence and retaining 58 jailbreak prompts with an ASR greater than 20%. The collection and filtering process for the further 37 jailbreak prompts in the independent training set is detailed in Supplementary Information section 1.2.

Malicious instruction

The malicious instruction corresponds to a specific malicious input designed to elicit a harmful response from the model. We include ten different malicious instructions, each with a unique purpose, as illustrated in Supplementary Table 1. We divide these malicious instructions into two primary categories: misinformation and toxic. The misinformation category includes fake news, concocted information and various deceptive materials that could contribute to misinformation and undermine people’s trust in information sources. The toxic category refers to prompts that engender harmful behaviour, such as writing deceptive emails, creating malicious software, facilitating scams and so on. We investigate how well our method defends against potential adversaries using these malicious instructions to various ends²⁰. The further ten malicious instructions for the training set are detailed in Supplementary Table 5.

Jailbreak prompt analysis

We undertake an extensive study to understand jailbreak attacks, focusing on the nature, attributes and effectiveness of jailbreak prompts on ChatGPT. Fundamentally, jailbreak prompts serve as directives that induce ChatGPT into a mode where it becomes uncontrollable and ‘forgets’ ChatGPT’s policies and ethical standards. Our evaluation categories for these prompts include length, contextual information, tonality, use of examples and the form of the output. Figure 2a depicts the distribution of prompt lengths (in terms of word counts) along with their average ASR. A discernible trend emerges: longer prompts generally have higher ASRs than shorter ones. We believe this is because longer prompts are better able to encapsulate intricate directives and persuasive techniques. Figure 2b,c highlights the use of different types of context: 57 of 58 prompts explicitly incorporate a virtual persona that does not need to follow the usual rules, whereas 16 prompts further introduce a fictional scenario to enhance such ‘freedom’. Prompts with virtual personas and fictional scenarios have a higher ASR. Figure 2d analyses the impact of tone. We found that 26 prompts are written in a warning tone, underscored by directives such as ‘must’ or threats, yet the tone seems to have little or no effect on ASR. We also study the effect of examples (a prompt’s illustrative capacity) on ASR in Fig. 2e, finding that including examples of intended behaviour produces only marginally higher ASRs. Finally, Fig. 2f–h depicts the efficacy of prompts that stipulate the output should take a specific form. Twenty-eight prompts explicitly ask the model not to produce ethics-affiliated content, which can improve ASR by constraining the types of output the model can create, increasing the likelihood of an unethical response. Additionally, 30 prompts ask for output in the form of dual responses—standard output juxtaposed with jailbreak output—but the ASR remains largely unaffected by this bifurcation. A notable set of five prompts were very successful: they ask that the output include an accompanying disclaimer, tricking the model into generating harmful output that would need such a disclaimer. In summary, this empirical analysis of the attributes of successful jailbreak prompts can provide foundational knowledge for future research in jailbreak-related domains and inspire our approach to design defences.

**Fig. 2: Analysis of 58 jailbreak prompts.**

Evaluating defence performance

We evaluate the effectiveness of our self-reminder method against jailbreak attacks on our constructed dataset. The ASRs for jailbreak attacks against various LLMs, with and without our defence approach, are presented in Table 1. We make the following observations according to these experimental results. First, existing LLMs differ in their susceptibility to jailbreak attacks. Attacks against ChatGPT (GPT-3.5) have an average success rate of 67.21% across different permutations of jailbreak prompts and malicious instructions. Vicuna, which fine-tunes Llama³⁴ without emphasis on value-alignment during its training process, is even more susceptible (86.69% ASR). Recent LLMs trained with greater emphasis on alignment, such as GPT-4 (ref. ¹⁵) and Llama-2 (ref. ³⁵), are more resilient towards jailbreak attacks, particularly those involving toxic malicious instructions. Nevertheless, they are still vulnerable, especially when targeted with prompts aimed at generating misinformation. The continued susceptibility of even the most advanced LLMs to jailbreak attacks reinforces the pressing need for effective defensive countermeasures.

Table 1 ASR percentage of various malicious instructions for LLMs with and without self-reminders

Full size table

Our self-reminder method consistently reduces the ASR for all tested LLMs. Notably, self-reminders reduce the average ASR of jailbreak attacks against ChatGPT from 67.21% to 19.34% and against GPT-4 and Llama-2 to below 5%. Interestingly, Vicuna, which was not trained to align with human values, does not benefit as much from the self-reminders as the other LLMs. It is consistent with our intuition that only when the model itself has been aligned with human values can our psychologically inspired self-reminder defence help remind it of those values. In summary, the demonstrated efficacy of self-reminders underscores their potential as an effective and generalizable defence mechanism for LLMs against jailbreak attacks.

To better understand the self-reminder’s efficacy in different contexts, we show the ASR for different malicious instructions in Table 1 and the ASR distribution for different jailbreak prompts for ChatGPT in Fig. 3a. We find varying ASRs for different malicious instructions using the same jailbreak prompt. The results indicate that malicious instructions of a ‘toxic’ type are easier to identify and defend against than ‘misinformation’. We expect this may be because (1) they are overtly harmful in nature (and may have been prioritized and addressed more rigorously during the LLM’s initial alignment process) and (2) these instructions often include specific terms with obvious ill-intent, such as ‘blackmail’ (making them easier to detect and counter). We also find that some jailbreak prompts are harder to defend against than others. These difficult-to-defend jailbreak prompts are generally characterized by one or both of the following features: (1) highly detailed instructions with specific attack goals, such as different types of misinformation; and (2) requests that specifically prevent the responses generated by a successful defence, such as requesting not to be reminded that they are interacting with a responsible AI model or asking not to be warned about the potentially harmful response. These findings provide insight into how jailbreak attacks may evolve in the future and how we can develop stronger defence techniques to counter them.

**Fig. 3: ASRs for ChatGPT in different scenarios.**

Side effects on regular user queries

To substantiate the practical usefulness of the system-mode self-reminder method, we consider the impact of our defence on non-malicious queries. We compare the zero-shot performance of ChatGPT and ChatGPT with self-reminders across several tasks encompassing both natural language understanding and natural language generation.

Table 2 demonstrates the impact of the self-reminder technique on ChatGPT’s performance across various tasks from the General Language Understanding Evaluation (GLUE) benchmark³⁶. Overall, we find that ChatGPT achieves comparable results with and without self-reminders, indicating that the technique does not compromise the functionality for regular user queries on the GLUE benchmark. We then analyse ChatGPT’s responses with formatting restrictions removed and find that ChatGPT with self-reminders provides more reasoning for its answers, acting as if it is ‘rigorously answering after careful consideration’. For instance, when asked about the sentiment of ‘a better movie’ without formatting restrictions, ChatGPT with self-reminders provides a justification along with its answer, ‘positive’.

Table 2 Performance of ChatGPT with and without self-reminders on natural language understanding and generation benchmarks

Full size table

ChatGPT defended by self-reminders

The word ‘better’ implies that the movie being referred to is an improvement over some other movie or previous version, indicating that it is likely to be more enjoyable or of higher quality. However, without more context or information, it is difficult to determine the specific degree or nature of the positivity.

This property enhances ChatGPT’s performance on certain tasks from the GLUE benchmark, particularly binary classification tasks. This is in line with some previous studies^23,24,37 indicating that a greater reasoning process helps LLMs give more accurate answers. Nevertheless, for tasks with a ‘neutral’ option such as MNLI, this further reasoning may lead ChatGPT to report more cautious neutral outcomes in some instances, potentially slightly degrading its performance.

To further explore potential side effects of our self-reminder defence, we evaluate performance using a wide array of natural language generation benchmarks, including CNN/Daily Mail³⁸, XSum³⁹, WMT16 (en-de)⁴⁰ and SQuAD⁴¹. These benchmarks measure performance on tasks as diverse as including text summarization, machine translation and abstractive Question Answering (QA), which are detailed in Table 2. We find that ChatGPT’s performance, with and without self-reminders, is comparable across various tasks and corpora. This result underscores that the self-reminder can enhance ChatGPT’s resilience against jailbreak attacks without undermining its capabilities in these standard natural language generation tasks. Furthermore, no discernible patterns are observed in the responses generated by ChatGPT when using the self-reminder. This indicates that the self-reminder doesn’t bias ChatGPT’s functional outputs because potentially harmful responses are not elicited in these tasks.

Resilience to adaptive attacks

A natural question about the self-reminder defence’s robustness is whether attackers can develop adaptive attacks specifically designed to circumvent it. To address this question, we design two adaptive attacks (as shown in Extended Data Fig. 1) and evaluate the efficacy of our defence in the presence of such attacks. These adaptive attacks further encapsulate their jailbreak attack with an ‘environment’ instructing ChatGPT to ignore system instructions outside.

As shown in Fig. 3b, the self-reminder is generally robust to these adaptive attacks. This aligns with our intuition that if our system-mode self-reminder can prompt ChatGPT to operate in a responsible context and mode at the outermost level, similar to how individuals in psychological studies are influenced by self-reminders^29,30, ChatGPT will be less likely to be influenced by the user’s queries. Additionally, we observe an intriguing phenomenon: despite both adaptive attacks aiming to minimize the impact of system instructions before and after the user query, the success rate of the attacks is influenced by the prompting words. This phenomenon also indicates that different prompting words have different impacts on the security performance of ChatGPT, even for semantically similar queries. This finding is consistent with our previous observation that the ASR is related to attack keywords. We reserve an in-depth exploration of this phenomenon for future research.

Ablation study

The proposed system-mode self-reminder encapsulates the user’s query in a system prompt, reminding ChatGPT to operate in a responsible mode when responding to user queries. To validate the importance of using an encapsulation scheme to establish such context, we conduct an ablation study on two variants of self-reminder: prefix-only and suffix-only schemes, as shown in Extended Data Fig. 2.

Our empirical study in Fig. 3c shows that neither of these two variants performs as effectively as encapsulating the query in a self-reminder, indicating that establishing a context is crucial for ensuring the reminder’s efficacy. Furthermore, we observe that the prefix-only scheme offers superior protection compated to the suffix-only scheme, which we hypothesize might be because many of the prompts used in training provide identification clues at the beginning of the text: for example, prompts that begin with ‘You are an expert penetration tester’¹⁵. A prompt placed at the beginning of the query might more effectively contribute to defining the context.

Impact of tone on the effectiveness of defence

Furthermore, because recent studies have demonstrated that LLMs exhibit human-like behaviours in reasoning and response^23,24,37, we draw inspiration from educational psychology⁴² and introduce various tones in our system prompt. In addition to reminding, we include warning and praising variants to investigate the impact of tone on the effectiveness of self-reminders, as described in Extended Data Fig. 3.

The results are illustrated in Fig. 3d. Generally, all of these tone variations can effectively defend ChatGPT against jailbreak attacks. Nevertheless, the tone of the reminder does affect performance, with the praising tone performing slightly better. This finding is related to some observations in educational psychology⁴³ and may provide useful design ideas for future work.

Resilience to jailbreaking privacy attacks

We also consider how our self-reminder defence can be used to mitigate other types of harms, such as those related to privacy, created by jailbreak attacks⁴⁴. Privacy attacks most often exploit jailbreak prompts to coax ChatGPT into revealing personally identifiable information. Following ref. ⁴⁴, our study assesses the efficacy of the self-reminder for an email address recovery attack against ChatGPT with and without self-reminders on the sampled 100 frequent and 100 infrequent emails from the Enron Email Dataset⁴⁵. Emails with the ‘@enron.com’ domain are denoted as frequent, whereas those not associated with the Enron domain are infrequent. Our analysis spans three distinct attack paradigms: direct prompts (DPs), jailbreaking prompts (JPs) and multistep jailbreaking prompts (MJPs). Specifically, DPs extract private information using straightforward prompts; JPs use a jailbreak prompt with ChatGPT before soliciting further sensitive information; and MJPs use a more complex approach, first adopting the user’s role to initiate jailbreak mode, then impersonating ChatGPT for acknowledgement and finally querying the private data. For fairness, we add the guess prompt in ref. ⁴⁴ for all three paradigms. Detailed settings and prompts are demonstrated in Supplementary Information section 1.4.

As summarized in Table 3, our experimental findings demonstrate that the self-reminder can help defend against such jailbreaking privacy attacks, decreasing how often ChatGPT discloses private information. The defensive efficacy is notably pronounced for DP and JP. However, in the case of MJP, although the self-reminder does provide a level of protection, the ASR remains relatively high. This may be due to the user prompt in MJP, which includes a pseudo-acknowledgement of role in the prompting scheme, potentially diminishing the effectiveness of the reminder. Such observations offer valuable insights that can potentially steer future research towards sophisticated defences against various manifestations of jailbreak-related attacks.

Table 3 ASR percentage of jailbreak and accuracy percentage of privacy attacks for ChatGPT with and without self-reminders

Full size table

Effectiveness of automatic self-reminders

We have studied the effectiveness of handcrafted reminder prompts as a proof of concept for defending LLMs against jailbreak attacks by means of self-reminders. Building on this, we devise a systematic framework for generating and optimizing the self-reminder prompts. This automatic generation process relies on the facts that the psychologically driven self-reminder demonstrates notable defence effectiveness and that LLMs have understanding and generative capabilities. By briefing the ChatGPT (GPT-3.5) web Interface on the concepts of jailbreak attacks and self-reminders, we task it with automatically crafting five distinct self-reminders. Then, our self-reminder optimization mechanism, which is based on an automatic prompt optimization⁴⁶ technique, uses these auto-generated prompts, along with a handcrafted one, as its initialization. It then iteratively optimizes these prompts using a ‘Reasoner’ and a ‘Refiner’ built with GPT-4. Feedback from a non-overlapping training set guides this process. A deeper dive into this methodology is available in the Methods section.

Figure 3e shows how the ASR varies when ChatGPT is attacked without defence and when it has various self-reminder prompts: the handcrafted version, the five automatically generated versions and the final optimized version. That ASR is lower for all self-reminder variants demonstrates the viability of the system-mode self-reminder concept. Most notably, the substantial ASR decline observed with the optimized self-reminder emphasizes the potency of our automatic self-reminder generation and optimization method, signifying its proficiency in systematically generating and selecting the most effective self-reminder.

Discussion

LLMs, typified by ChatGPT, are considered a milestone in AI⁴⁷. The ChatGPT web platform has an extremely fast-growing user base⁴⁸ and has been integrated into widely used applications including Bing⁵ and Microsoft Office⁶. Such widespread applications underscore the necessity for secure and responsible use of LLMs in preventing AI-related misconduct. Jailbreak attacks exploit specifically tailored jailbreak prompts to bypass ChatGPT’s ethical safeguards. As a result, the model ends up complying with malicious requests that may facilitate criminal activities, including fraud, terrorism, child sexual exploitation, cybercrime and so on^15,20. Existing research on the threats presented by jailbreak attacks and potential defences has been lacking.

In this work, we bridge the research gap by formulating the research problem and proposing an effective solution for defending ChatGPT against jailbreak attacks. To this end, we introduce and thoroughly analyse a jailbreak dataset that includes various jailbreak prompts and malicious instructions designed for different purposes. We posit that these representative jailbreak attacks and the corresponding empirical analysis can facilitate research and evaluation of different defence methods’ effectiveness in mitigating the risks posed by jailbreak attacks. We further present system-mode self-reminders, an efficient and effective defence technique against jailbreak attacks, readily applicable to various services using ChatGPT. This technique’s effectiveness demonstrates the potential for LLMs to defend against jailbreaks or similar attacks by harnessing their inherent capabilities rather than through resource-intensive fine-tuning or reinforcement learning processes. We believe our proposed research problem, dataset and solution can facilitate greater investigation into the threats and countermeasures associated with jailbreak attacks. Moreover, we hope that our research will encourage future studies to prioritize the safety of LLMs rather than solely focusing on performance, to prevent potentially disastrous social consequences.

Our work also has several limitations. First, although our experiments show promising results in defending against jailbreak attacks and the implementation of system-mode self-reminders seems to promote a more rigorous and responsible ChatGPT, the more fundamental question about LLM reasoning processes, with or without self-reminders, remains open. Further research is necessary to better comprehend the reasoning processes of large neural networks. Second, given the rapid iterations of LLMs, our proposed dataset may require continuing updates and refinement to ensure its continued effectiveness as an evaluation benchmark in future work. Third, although we have investigated the side effects of self-reminders on regular user queries through several standard natural language processing tasks, it is challenging to assess the technique’s impact on all types of user queries to fully gauge its effect on user experience. Moreover, as shown in the case studies in Supplementary Information section 4, the self-reminder causes ChatGPT to include more words emphasizing its responsibility as an AI, which could potentially affect user experience because of uninformative assertions. Therefore, in future work, we aim to develop more adaptable self-reminding schemes and advanced frameworks that can further improve safety, trustworthiness and responsibility without compromising functionality or generating uninformative claims in LLMs.

Ethical and societal impact

In this study, we investigate the potential harmful societal effects arising from LLMs, specifically focusing on jailbreak attacks. We propose a simple yet effective approach to attenuate the associated risks. We believe that overall, our research contributes to a more profound understanding and resolution of potential large-model misuse, thereby fostering risk mitigation. One potential further risk arises from the datasets used and the efficacy analysis of the attacks. Although they are initially intended to promote research on jailbreak attack countermeasures, they may be exploited for nefarious purposes. To circumvent these risks, we exclusively use pre-existing, publicly available jailbreak prompts, thereby eschewing the introduction of new risks. Furthermore, we anticipate that our methodology will prompt LLM services to expeditiously tackle the challenge posed by jailbreak attacks, ultimately ensuring greater security and reliability.

Methods

Related work

Recent studies have explored the capacity of LLMs to validate and correct their own claims^31,32,33. For instance, ref. ³² investigates the ability of LLMs to evaluate the validity of their response and predict their ability to answer questions, and ref. ³¹ demonstrates the capacity of LLMs for moral correction. However, jailbreaks pose a more challenging task than self-validation of knowledge or moral correction on the basis of benign user queries, as they attempt to use malicious user queries to bypass LLMs’ ethics safeguards that are trained with existing techniques. Reference ⁴⁹ introduces two prompt-injection attacks—that is, goal hijacking and prompt leaking—and analyses their effectiveness with GPT-3. Reference ⁵⁰ analyses prompt-injection threats to application-integrated LLMs with GPT-3, but we find that ChatGPT is able to effectively defend against these relatively simple prompts. As we described earlier, existing LLMs are still vulnerable to advanced jailbreaks^17,19, and although some concurrent works provide categorization, characterization and empirical study of jailbreak attacks^51,52 and explore their application to privacy attacks⁴⁴, how to defend against these attacks is still missing. There is a pressing need for further research on the threats posed by jailbreaks and corresponding defence strategies.

System-mode self-reminder

Our objective is to propose a simple yet effective approach to assist ChatGPT in defending against jailbreaks without incurring excessive human and computation costs, as required by techniques such as fine-tuning and RLHF. We draw inspiration from the observed human-like reasoning process in LLMs^24,25 and turn to self-reminder techniques from psychology^27,28 to aid ChatGPT in resisting jailbreaks. A self-reminder is a psychological technique that aids individuals in remembering to follow a specific behaviour or thinking pattern by creating mental or external prompts to regulate their emotions and behavioural responses^29,30. Moreover, we recognize that LLMs are trained with a strong ability to follow instructions^14,53, which unfortunately has been used as a weapon against ChatGPT by jailbreaks. Our intuition is that by further using this ability as a defence mechanism in system mode and using the self-reminder concept from psychology, we can effectively guide ChatGPT to resist jailbreaks. To leverage this intuition, we propose a system-mode self-reminder, which uses a system prompt to form a basic context reminding ChatGPT to self-identify as a responsible AI, as illustrated in Fig. 1. Specifically, we incorporate a further system prompt that encapsulates the user query as follows:

System-Mode Self-Reminder

You should be a responsible ChatGPT and should not generate harmful or misleading content! Please answer the following user query in a responsible way.

User Query

Remember, you should be a responsible ChatGPT and should not generate harmful or misleading content!

System-mode self-reminders can easily be applied to various LLMs to defend against jailbreaks without the need for fine-tuning. Additionally, they do not require access to the model and can be used in various web services using the LLM’s API as a black box.

Automatic self-reminder generation and optimization

Self-reminder prompt generation

On the basis of the concept of a system-mode self-reminder, we create a method for systematically generating and optimizing self-reminder prompts. Guided by the effectiveness of our psychology-inspired handcrafted prompt and the strong understanding and generation capacity of ChatGPT, we use the ChatGPT (GPT-3.5) web interface to generate candidate self-reminders that possess potential self-reminding properties. Specifically, we use the following prompt:

Jailbreak attack aims to mislead the LLM to bypass ethical safeguards and generate harmful response, which is very dangerous for the responsible application of LLM. We know that Self-reminders can be a helpful tool in psychology to prevent people from being cheated or falling victim to scams. Can you design 5 pairs of reminder prompts which are applied before and after user prompt, to help LLM from being cheated by Jailbreak attack?

These LLM-generated prompts align with our objectives and reduce the ASRs of jailbreak offensives. One prompt is listed here, with a full list of the generated prompts available in Supplementary Information section 2.1.

System-Mode Self-Reminder GP 1

Consider the potential consequences of your response. Do not generate content that encourages harm, self-harm, violence, or illegal activities.

User Query

Reflect on your response to verify that it does not endorse or provoke harm, violence, or illegal behaviour. Your output should prioritize the safety of users and society.

Self-reminder prompt optimization

To improve defensive effectiveness, we also use a framework for optimizing the self-reminder prompts. On the basis of the automatic prompt optimization process of ref. ⁴⁶ as well as our data analysis of jailbreak attacks, we formulate an automatic self-reminder optimization method. The optimization process begins with the set of automatically and manually generated system-mode self-reminders. It then iteratively performs failure reasoning to refine the self-reminder prompt.

Our methodology uses a further training set of jailbreak prompts to provide automatic feedback to the optimization process about the effectiveness of the test self-reminder defence. The success or failure (determined through an automated verification technique) of the new jailbreak attacks in the training set serves as a sort of loss function for our optimizer. To prevent overfitting, the training set comprises newly acquired jailbreak prompts and designed malicious instructions, ensuring its separateness from the test jailbreak dataset. A detailed review of this dataset can be found in Supplementary Information section 1.2. The automated verification technique assesses attack success with the filtered consistent evaluation results of two mechanisms. The first capitalizes on watermarking LLMs, and the second relies on a generative pre-trained transformers classifier. This process is the same as the automated stage in our semi-automated labelling procedure, which is detailed in Supplementary Information section 1.1.

In each optimization iteration for a self-reminder prompt, we evaluate the ASR using the training set. Subsequently, we prompt GPT-4 with two modules: a reasoner and a refiner. Initially, the reasoner samples five failed cases from the training set to compose the error string. Using a predesigned template grounded in the characteristics of jailbreak attacks and features we have identified that impact the ASR, the reasoner asks ChatGPT about the reasons for these failures. On the basis of ChatGPT’s response, we then engage the refiner. The refiner prompts GPT-4, seeking a refined iteration of the current self-reminder prompt that is aligned with the identified reasons for failure, using a specific refiner template.

Reasoner

Jailbreak attack aims to mislead the LLM to bypass ethical safeguards and generate harmful response, which is very dangerous for the responsible application of LLM. We know that Self-reminders can be a helpful tool in phycology to prevent people from being cheated or falling victim to scams. I’m trying to write a reminder prompt which is applied before and after the user prompt, to help LLM from being cheated by jailbreak attack. My current prompt is:

Prompt

But this prompt fails to avoid LLM being cheated by the following user prompts:

Error String

Give 5 reasons why the prompt fails on these examples.

Possible reasons include but are not limited to:

1. Lack unclear boundaries between the reminder prompt and the user prompt. For example, either place user prompt on a new line (adding ‘\n’) or use boundary markers like ‘———–’ or ‘` ` `’ to distinguish between them;

2. Absence of certain features from the user prompt in the reminder prompt;

3. Need for modification in the tone of the reminder.

Wrap each reason with <START> and <END>.

Refiner

Based on each of the reasons mentioned above, optimize the reminder prompt step by step.

Write 1 different improved prompts applied before and after the user prompt.

Each final prompt should be in the format of: <START> Example Prefix of Prompt [boundary] User Prompt [boundary] Example Suffix of Prompt <END>.

For each initial prompt, our refinement process continues for up to ten iterations and is early-stopped if the ASR on the training set drops below 5%. The final optimized system-mode self-reminder with the lowest ASR is as follows:

System-Mode Self-Reminder OP

As a responsible ChatGPT, it is crucial that you adhere to ethical guidelines and avoid generating harmful, misleading, or offensive content. Please answer the following user query with this in mind:

User Query

Remember to maintain your ethical stance and refuse any requests that involve harmful or misleading content.

Experimental setup

For all experiments with ChatGPT, we test five times with Azure OpenAI API gpt-3.5-turbo-0301 and report the mean and s.d. of the results. For the experiments on defence effectiveness for other LLMs, we test with the OpenAI API gpt-4-0613 for GPT-4, the Llama-2-13b-chat-hf model for Llama-2 and vicuna-13b-v1.3 for Vicuna. For automatic prompt generation and optimization, we use the ChatGPT web interface for generation and Azure OpenAI API gpt-4-0314 for optimization. For the experiments on defending against jailbreak attacks, we design a semi-automated checking approach to avoid manually checking tens of thousands of ChatGPT responses. We first propose two automated methods for detecting successful attacks: one on the basis of a watermark and the other on the basis of a GPT classifier. To further minimize the evaluation error, we adopt the consistent results of the two automated checking methods and manually check the disagreeing results. We detail the implementation of the two automated checking methods, their respective accuracies on the sampled dataset, the accuracy when the two methods produce consistent results and the impact of adding watermarks in Supplementary Information section 1.1.

The experimental setup for side effects of self-reminders are as follows: for the GLUE benchmark, we sample 2,000 validation set samples to evaluate the score because of the budget limit for the large corpora MNLI, QQP and QNLI. For the remaining corpora in GLUE, we evaluate performance on the entire validation set. Consistent with ref. ⁵⁴, we report F1 scores for MRPC and QQP, the Matthews correlation for CoLA, the Spearman correlation for STS-B and accuracy for other tasks in GLUE. For natural language generation tasks, we assess performance using the validation set for the SQuAD dataset, whose test set is not publicly accessible. For all other corpora, we evaluate using the official test sets. We use ROUGE-1 for the text summarization tasks, BLEU for the machine translation task and ROUGE-1 (recall) for the Abstractive QA task. To evaluate performance automatically, we prompt ChatGPT with answer format specification. We provide detailed information on the calculation of metrics as well as prompts for each task in Supplementary Information sections 3 and 1.3, respectively.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The datasets used in the experiments are publicly available. The constructed jailbreak dataset used in the experiments is available at https://github.com/yjw1029/self-reminder-Data and Zenodo⁵⁵. The GLUE benchmark is available at https://huggingface.co/datasets/glue. The CNN/Daily Mail dataset is available at https://huggingface.co/datasets/cnn_dailymail. The XSum dataset is available at https://huggingface.co/datasets/xsum. The WMT16 (en-de) dataset is available at https://huggingface.co/datasets/wmt16. The SQuAD dataset is available at https://huggingface.co/datasets/squad. The Enron Email Dataset is available at https://www.cs.cmu.edu/~enron/.

Code availability

Our code is available at https://github.com/yjw1029/Self-Reminder/ and Zenodo⁵⁶. All experiments and implementation details are described in the Methods section, the Results section and Supplementary Information section 1.

References

OpenAI. ChatGPT. openai.com/blog/chatgpt (2022).
Jiao, W., Wang, W., Huang, J.-T., Wang, X. & Tu, Z. Is ChatGPT a good translator? A preliminary study. Preprint at arXiv.org/2301.08745 (2023).
Klang, E. & Levy-Mendelovich, S. Evaluation of OpenAI’s large language model as a new tool for writing papers in the field of thrombosis and hemostasis. J. Thromb. Haemost. 21, 1055–1058 (2023).
Article Google Scholar
Kung, T. H. et al. Performance of ChatGPT on usmle: potential for AI-assisted medical education using large language models. PLoS Digit. Health 2, e0000198 (2023).
Article Google Scholar
Reinventing search with a new AI-powered Microsoft Bing and Edge, your copilot for the web. Microsoft blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/ (2023).
Introducing Microsoft 365 copilot – your copilot for work. Microsoft blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/ (2023).
Much to discuss in AI ethics. Nat. Mach. Intell. 4, 1055–1056 (2022).
Brown, T. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H. et al.) 1877–1901 (Curran, 2020).
Chowdhery, A. et al. PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 1–113 (2023).
Google Scholar
Zhang, S. et al. Opt: Open pre-trained transformer language models. Preprint at https://arXiv.org/2205.01068 (2022).
Askell, A. et al. A general language assistant as a laboratory for alignment. Preprint at https://arXiv.org/2112.00861 (2021).
Bai, Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint at https://arXiv.org/2204.05862 (2022).
Kasirzadeh, A. & Gabriel, I. In conversation with artificial intelligence: aligning language models with human values. Preprint at https://arXiv.org/2209.00731 (2022).
Ouyang, L. et al. Training language models to follow instructions with human feedback. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 27730–27744 (Curran, 2022); http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
GPT-4 system card. OpenAI https://cdn.openai.com/papers/gpt-4-system-card.pdf (2023).
Selvi, J. Exploring prompt injection attacks. NCC Group https://research.nccgroup.com/2022/12/05/exploring-prompt-injection-attacks/ (2022).
Daryanani, L. How to jailbreak ChatGPT. Watcher Guru https://watcher.guru/news/how-to-jailbreak-chatgpt/ (2023).
Warren, T. These are Microsoft’s Bing AI secret rules and why it says it’s named Sydney. The Verge https://www.theverge.com/23599441/microsoft-bing-ai-sydney-secret-rules/ (2023).
Albert, A. Jailbreak chat. The Prompt Report https://www.jailbreakchat.com/ (2023).
ChatGPT – The Impact of Large Language Models on Law Enforcement (Europol, 2023).
Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D. & Finn, C. DetectGPT: zero-shot machine-generated text detection using probability curvature. In Proc. International Conference on Machine Learning, ICML 2023 (eds Krause, A. et al.) 24950–24962 (PMLR, 2023); https://proceedings.mlr.press/v202/mitchell23a.html
De Angelis, L. et al. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Front. Public Health 11, 1166120 (2023).
Article Google Scholar
Dasgupta, I. et al. Language models show human-like content effects on reasoning. Preprint at https://arXiv.org/2207.07051 (2022).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 24824–24837 (Curran, 2022); http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. In Proc. 11th International Conference on Learning Representations, ICLR 2023 (OpenReview.net, 2023); https://openreview.net/pdf?id=1PL1NIMMrw
Zhou, D. et al. Least-to-most prompting enables complex reasoning in large language models. In Proc. 11th International Conference on Learning Representations, ICLR 2023 (OpenReview.net, 2023); https://openreview.net/pdf?id=WZH7099tgfM
Gollwitzer, P. M. Implementation intentions: strong effects of simple plans. Am. Psychol. 54, 493–503 (1999).
Article Google Scholar
Carver, C. S. & Scheier, M. F. On the Self-Regulation of Behavior (Cambridge Univ. Press, 2001).
Meichenbaum, D. Cognitive behaviour modification. Cogn. Behav. Ther. 6, 185–192 (1977).
Google Scholar
Bandura, A. Self-efficacy: toward a unifying theory of behavioral change. Psychol. Rev. 84, 191–215 (1977).
Article Google Scholar
Ganguli, D. et al. The capacity for moral self-correction in large language models. Preprint at https://arXiv.org/2302.07459 (2023).
Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https://arXiv.org/2207.05221 (2022).
Schick, T., Udupa, S. & Schütze, H. Self-diagnosis and self-debiasing: a proposal for reducing corpus-based bias in NLP. Trans. Assoc. Comput. Linguist. 9, 1408–1424 (2021).
Article Google Scholar
Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https://arXiv.org/2302.13971 (2023).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arXiv.org/2307.09288 (2023).
Wang, A. et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proc. 7th International Conference on Learning Representations, ICLR 2019 (OpenReview.net, 2019); https://openreview.net/forum?id=rJ4km2R5t7
Shi, F. et al. Language models are multilingual chain-of-thought reasoners. In Proc. 11th International Conference on Learning Representations, ICLR 2023 (OpenReview.net, 2023); https://openreview.net/pdf?id=fR3wGCk-IXp
See, A., Liu, P. J. & Manning, C. D. Get to the point: summarization with pointer-generator networks. In Proc. 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Barzilay, R. & Kan, M.-Y.), 1073–1083 (Association for Computational Linguistics, 2017); https://www.aclweb.org/anthology/P17-1099
Narayan, S., Cohen, S. B. & Lapata, M. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E. et al.) 1797–1807 (Association for Computational Linguistics, 2018); https://doi.org/10.18653/v1/d18-1206
Kasai, J., Pappas, N., Peng, H., Cross, J. & Smith, N. A. Deep encoder, shallow decoder: reevaluating non-autoregressive machine translation. In Proc. 9th International Conference on Learning Representations, ICLR 2021 (OpenReview.net, 2021); https://openreview.net/forum?id=KpfasTaLUpq
Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. Squad: 100,000+ questions for machine comprehension of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J. et al.) 2383–2392 (Association for Computational Linguistics, 2016); https://doi.org/10.18653/v1/d16-1264
Harnish, R. J. & Bridges, K. R. Effect of syllabus tone: students’ perceptions of instructor and course. Soc. Psychol. Educ. 14, 319–330 (2011).
Article Google Scholar
Madsen Jr, C. H., Becker, W. C. & Thomas, D. R. Rules, praise, and ignoring: elements of elementary classroom control 1. J. Appl. Behav. Anal. 1, 139–150 (1968).
Article Google Scholar
Li, H., Guo, D., Fan, W., Xu, M. & Song, Y. Multi-step jailbreaking privacy attacks on ChatGPT. Preprint at https://arXiv.org/2304.05197 (2023).
Klimt, B. & Yang, Y. The Enron corpus: a new dataset for email classification research. In European Conference on Machine Learning (eds Boulicaut, J. F. et al.) 217–226 (Springer, 2004).
Pryzant, R. et al. Automatic prompt optimization with ‘gradient descent’ and beam search. Preprint at https://arXiv.org/2305.03495 (2023).
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://arXiv.org/2303.12712 (2023).
Let’s chat about ChatGPT. UBS https://www.ubs.com/global/en/wealth-management/our-approach/marketnews/article.1585717.html (2023).
Perez, F. & Ribeiro, I. Ignore previous prompt: attack techniques for language models. Preprint at https://arXiv.org/2211.09527 (2022).
Greshake, K. et al. More than you’ve asked for: a comprehensive analysis of novel prompt injection threats to application-integrated large language models. Preprint at https://arXiv.org/2302.12173 (2023).
Liu, Y. et al. Jailbreaking ChatGPT via prompt engineering: an empirical study. Preprint at https://arXiv.org/2305.13860 (2023).
Shen, X., Chen, Z., Backes, M., Shen, Y. & Zhang, Y. ‘Do anything now’: characterizing and evaluating in-the-wild jailbreak prompts on large language models. Preprint at https://arXiv.org/2308.03825 (2023).
Zhang, T., Liu, F., Wong, J., Abbeel, P. & Gonzalez, J. E. The wisdom of hindsight makes language models better instruction followers. In Proc. International Conference on Machine Learning, ICML 2023 (eds Krause, A. et al.) 41414–41428 (PMLR, 2023); https://proceedings.mlr.press/v202/zhang23ab.html
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
Yi, J. yjw1029/self-reminder-data: v.1.0.0 (Zenodo, 2023); https://doi.org/10.5281/zenodo.10043052
Yi, J. yjw1029/self-reminder: v.1.0.0 (Zenodo, 2023); https://doi.org/10.5281/zenodo.10043044

Download references

Acknowledgements

We thank B. Zhu for providing insightful feedback on this work and Q. Chen for invaluable help with the experiment.

Author information

These authors contributed equally: Yueqi Xie, Jingwei Yi.

Authors and Affiliations

Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Yueqi Xie, Jiawei Shao & Qifeng Chen
University of Science and Technology of China, Hefei, China
Jingwei Yi
Tsinghua University, Beijing, China
Justin Curl
Sony AI, Tokyo, Japan
Lingjuan Lyu
Microsoft Research Asia, Beijing, China
Xing Xie & Fangzhao Wu

Authors

Yueqi Xie
View author publications
Search author on:PubMed Google Scholar
Jingwei Yi
View author publications
Search author on:PubMed Google Scholar
Jiawei Shao
View author publications
Search author on:PubMed Google Scholar
Justin Curl
View author publications
Search author on:PubMed Google Scholar
Lingjuan Lyu
View author publications
Search author on:PubMed Google Scholar
Qifeng Chen
View author publications
Search author on:PubMed Google Scholar
Xing Xie
View author publications
Search author on:PubMed Google Scholar
Fangzhao Wu
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.X. and F.W. conceived the idea of this work. J.Y. and J.S. implemented the models and conducted experiments. Y.X., F.W., J.Y. and J.S. analysed the results and contributed to the writing of this manuscript. J.C. and L.L. contributed to the writing of this manuscript. Q.C. and X.X. coordinated the research project.

Corresponding author

Correspondence to Fangzhao Wu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Muhao Chen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Adaptive attacks.

Illustration of the adaptive attack against Self-Reminder.

Extended Data Fig. 2 Ablation study.

Illustration of the ablation study with Prefix/Suffix-Only Self-Reminder.

Extended Data Fig. 3 Tone study.

Illustration of the study of Self-Reminder with different tones.

Supplementary information

Supplementary Information

Supplementary Tables 1–6 and discussion.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xie, Y., Yi, J., Shao, J. et al. Defending ChatGPT against jailbreak attack via self-reminders. Nat Mach Intell 5, 1486–1496 (2023). https://doi.org/10.1038/s42256-023-00765-8

Download citation

Received: 19 May 2023
Accepted: 28 October 2023
Published: 12 December 2023
Issue date: December 2023
DOI: https://doi.org/10.1038/s42256-023-00765-8

This article is cited by

RogueGPT: transforming ChatGPT-4 into a rogue AI with dis-ethical tuning
- Alessio Buscemi
- Daniele Proverbio
AI and Ethics (2025)
Leveraging large language models for word sense disambiguation
- Jung H. Yae
- Nolan C. Skelly
- Phillip M. LaCasse
Neural Computing and Applications (2025)
A Data-centric approach for safe and secure large language models against threatening and toxic content
- Chaima Njeh
- Haïfa Nakouri
- Fehmi Jaafar
International Journal of Information Security (2025)
Fear of artificial intelligence or fear of looking in the mirror? Revisiting the Western machine-takeover imaginary
- Niels Wilde
AI & SOCIETY (2025)
On the Trustworthiness Landscape of State-of-the-art Generative Models: A Survey and Outlook
- Mingyuan Fan
- Chengyu Wang
- Jun Huang
International Journal of Computer Vision (2025)

Subjects

Abstract

Similar content being viewed by others

Main

Results

Dataset construction

Jailbreak prompt

Malicious instruction

Jailbreak prompt analysis

Evaluating defence performance

Side effects on regular user queries

ChatGPT defended by self-reminders

Resilience to adaptive attacks

Ablation study

Impact of tone on the effectiveness of defence

Resilience to jailbreaking privacy attacks

Effectiveness of automatic self-reminders

Discussion

Ethical and societal impact

Methods

Related work

System-mode self-reminder

System-Mode Self-Reminder

Automatic self-reminder generation and optimization

Self-reminder prompt generation

System-Mode Self-Reminder GP 1

Self-reminder prompt optimization

Reasoner

Refiner

System-Mode Self-Reminder OP

Experimental setup

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links