Abstract
The rapid growth of AI systems has been fueled by large-scale human data, intensifying concerns over the unauthorized use of intellectual property and privacy-sensitive content during model training. Auditing such misuse is particularly challenging since mainstream AI services operate as black boxes, exposing only generated outputs while concealing their training and inference processes. In this work, inspired by chemical isotope tracing, we introduce the concept of information isotopes to trace training data within opaque AI systems. We propose an information-isotope tracing framework that selectively marks target data elements and detects their propagation in model outputs, providing concrete evidence of data utilization under black-box access. Experiments on thirteen AI models across six datasets demonstrate that our method distinguishes training from non-training data with up to 99% accuracy and strong statistical significance (p < 0.01) using approximately 4,000 words of evidence. An open-source tool is released to support practical data rights protection.
Similar content being viewed by others
Introduction
Artificial intelligence (AI) technologies, exemplified by large language models (LLMs), have witnessed remarkable advancements, showcasing exceptional capabilities in generating content with human-like coherence1. These AI systems have demonstrated the ability to provide expert-level insights tailored to user needs and, in some cases, have achieved outputs that rival human performance, thereby exerting a growing influence across various societal domains2,3. Recent investigations into the underlying mechanisms of AI systems have revealed that their primary source of intelligence lies in the extensive knowledge embedded within their massive training data4,5. Consequently, an intensifying competition has emerged among AI developers to acquire increasingly large-scale, human-generated datasets in order to maintain a leading edge in AI model development6,7. However, many large-scale datasets sourced from public internet platforms are governed by strict licensing agreements that prohibit commercial use or even data collection6,8. In addition, the user data collected from online social media or private domains (e.g., the user interaction records on AI platforms) may also contain privacy-sensitive information, the utilization of which for AI training shall be authorized9,10. Regrettably, many AI developers, either inadvertently or deliberately, are incorporating unauthorized data into their training processes, thereby infringing upon the rights of data owners and raising significant ethical and legal concerns6,8. Moreover, AIs trained on such unauthorized data are also at an elevated risk of producing outputs that are relevant to privacy- or intellectual property-sensitive information embedded in their training data11,12. These risks not only facilitate the illegal distribution of unauthorized data, exacerbating infringements on data ownership rights, but also expose AI users to potential legal and ethical liabilities associated with data misuse. A growing number of lawsuits, including The N.Y. Times v. Microsoft, Tremblay v. OpenAI, Andersen v. Stability AI, Authors Guild v. OpenAI, and other similar cases, highlight the urgent need for data rights protection in AI development.
To safeguard the rights of data owners, governance institutions worldwide have enacted legislation to prevent the unauthorized use of data13,14. A critical aspect of protecting data rights, particularly in legal disputes, lies in presenting robust evidence of data misuse. In traditional data breach scenarios, infringers typically exploit unauthorized transactions of raw data for personal or commercial gain15. Thus, examining the consistency between the original data and the unauthorized data can serve as compelling evidence of infringement15,16. For example, the verbatim replication of original literary works on unauthorized platforms is a clear indication of data infringement. However, in the context of AI training, unauthorized data usage manifests differently, instead of distributing raw data, infringers mainly exploit it to enhance AI system performance17,18,19,20. Consequently, only the resulting AI services, rather than the original data, are made publicly accessible, posing significant challenges in identifying and evidencing data infringement21. This issue is further exacerbated by the tendency of AI developers to obscure or entirely withhold information regarding the sources of their training data6. This lack of transparency significantly hinders the ability of data owners to even become aware of unauthorized data usage in AI training, thereby complicating the enforcement of data protection regulations and the achievement of compliance with legal frameworks.
To address this challenge, recent studies have explored various membership inference techniques22,23,24,25,26 to detect the training data of advanced AIs (e.g., LLMs). Some methods largely build upon the empirical observation that LLMs tend to assign high generation likelihoods to most tokens from a training data entry22,23,24,25,26. For instance, Shi et al.22 introduced a method that averages the K% lowest generation likelihood values among the tokens in a suspected input, identifying samples with high average scores as likely training data. In addition, some other works have observed that perturbations to training data samples often result in a more significant degradation in generation confidence than similar perturbations to non-training data. For example, Fu et al.27 proposed to rewrite the input sample using a language model and then compute gthe eneration likelihood degradation of the target model as the detection score. Collectively, these methods typically adopt a gray-box assumption, requiring access to internal computation features on the input data, such as token-level likelihoods or probability distributions, to audit its membership. However, most commercial AI systems remain highly opaque7, providing only the final generated content while concealing all gray-box internal computation features (Fig. 1A). As a result, the applicability of these methods to real-world AI deployments is severely limited. Moreover, several studies began to explore label-only membership inference attacks (MIA) against LLMs28,29,30, typically under the assumption that a surrogate model is available to approximate the generation probability distribution of the target LLM. However, the generalizability of such methods remains limited, as it is generally infeasible to construct a surrogate model that accurately aligns with a closed-source LLM whose architecture is usually unknown, heterogeneous, and potentially large in scale. Therefore, in practical scenarios, reliable detection of unauthorized training data usage would usually need to rely on black-box methodologies, which are constrained to observing only the model outputs, without access to any surrogate models or internal parameters of the target LLM. However, this constraint introduces significant challenges. Advanced AI models possess extensive general knowledge and are explicitly designed to avoid the verbatim reproduction of their training data31,32,33. These AIs may not only refuse to generate outputs that directly replicate their training data but may also produce content similar to non-training data based on their broad generalization capabilities. This overlap makes it inherently difficult to distinguish between outputs derived from training data and those generated from non-training data solely through output analysis. For instance, when an AI model is prompted with a segment of its original training data, the generated continuation can exhibit low similarity to the original continuation, underscoring the complexities associated with detecting training data based on generated content (Fig. 1C). Therefore, how to develop effective methods for auditing data usage in purely black-box AI systems warrants further investigation.
A Limitations of mainstream training data detection methods. Most existing methods typically input suspected content into the target model and analyze the corresponding intermediate variables (e.g., probability distributions) to infer training data usage. However, such internal signals are generally inaccessible in mainstream AI platforms (e.g., ChatGPT), limiting the practical applicability of these methods. B Distributions of information isotope recovery success rates. We evaluate the behavior of four opaque LLMs (e.g., ChatGPT and Gemini) with respect to the recoverability of fine-grained textual elements, using datasets comprising news articles and programming code repositories. Results show that, when presented with masked content, a model is significantly more likely to recover the exact original textual element than to produce semantically similar but distinct alternatives (termed information isotopes), if the content originates from their training data. This effect is substantially attenuated for non-training data, suggesting the potential utility of information isotope recovery as a signal for auditing data usage in AI training. C Generation similarity analysis. Similarity scores between AI-generated and original human-written continuations show no statistically significant difference across training and non-training sets, underscoring the challenges of relying solely on AI-generated content to detect training data usage. Detailed settings and discussions for these evaluations are presented in the Results section. The center point represents the median, the box limits correspond to the upper and lower quartiles, and the whiskers extend to 1.5 times the interquartile range. D Information isotope tracing method. The method begins by selectively marking high-traceability elements (depicted as blue symbols) within a suspected input. It then constructs semantically confounding information isotopes (symbols of identical shapes but different colors). Finally, it detects potential training data usage by examining whether AI outputs contain the original elements rather than their isotopes.
In this study, inspired by isotope labeling techniques widely used for element tracing in microscopic experiments, we investigate an analogous method for tracing data within opaque AI training processes. We begin by presenting empirical evidence that demonstrates the significant traceability of fine-grained semantic elements embedded in training data. Notably, while existing gray-box methods have partially explored this traceability, it remains insufficiently established in fully black-box settings where only model generations are accessible. We first define a semantic element as a fine-grained unit within data, such as a named entity, alongside its synonymous variants that preserve semantic meaning while differing in symbolic form, termed information isotopes. We further design a semantic confounding instruction task, which prompts the AI model to recover a target element from its corresponding information isotopes by filling in a masked position within the input data to which the target element belongs. Our empirical results (Fig. 1B) indicate that, when the input data has been encountered during training, the model exhibits a strong tendency to preferentially memorize and regenerate the target element of the exact symbolic expression, rather than selecting among alternative information isotopes, achieving a recovery success rate (RSR) of 75.5%. In contrast, this memorization pattern is substantially attenuated for non-training data, where the RSR drops to 65.5% under the same conditions. Leveraging this observation, we introduce an information isotope tracing mechanism for auditing training data of opaque AI systems, which selectively marks certain elements in the training data and subsequently identifies their presence in AI-generated outputs by distinguishing them from their isotopic counterparts (Fig. 1D).
We conduct an extensive comparison between InfoTracer and representative baselines across 13 LLMs and 6 datasets spanning three critical domains, including privacy-sensitive medical texts, copyrighted books, and novels. To ensure reliable performance verification, we first evaluate our method on the open-source LLaMA model series, whose training corpora are publicly documented. The results demonstrate that InfoTracer achieves very high detection accuracy (exceeding 99%) with strong statistical significance (p-value < 0.01) in distinguishing training data from non-training data. Further analyses reveal the generalizability of InfoTracer across diverse data domains and its robustness against various adversarial data attack strategies, highlighting its potential for practical data-auditing deployments. We additionally extend our evaluation to nine commercial LLM API services, including GPT-3.5, GPT-4o, Claude-3.0, and Doubao-1.5-pro, using datasets containing suspected member and non-member samples from both news and code domains. InfoTracer consistently maintains high detection accuracy, confirming its robust effectiveness and practical applicability for auditing training data usage in opaque commercial AI systems. Evaluations on several state-of-the-art commercial LLMs (e.g., GPT-4o) also confirm the superiority of our method over representative baselines, and demonstrate its scalability to very large models. A further large-scale experiment involving twenty-one complete novels (over one million words) with the Doubao-1.5-pro model substantiates the ability of InfoTracer to accurately and significantly identify long-form training data, underscoring its scalability and real-world relevance. We also release our work as an open-source and user-friendly software, intended to serve as an inclusively, broadly applicable, and practical tool for supporting individuals in the protection of their data rights in both privacy and intellectual property, and promoting a more equitable and responsible AI development ecosystem.
Results
The property of information isotopes
Next, we first provide a formal definition of information isotopes and then examine their traceability properties within AI training processes. A semantic element refers to a basic unit of meaning, such as an entity (e.g., “New York”) or an individual token, embedded within an input data entry. An information isotope of a given semantic element is defined as a variant that preserves the same semantic content under an identical context, e.g., “NYC”. We further define element traceability as the probability that a target AI model can successfully identify the target semantic element from its semantically confounding information isotopes when tasked with completing a masked version of the original data entry. The central insight underlying our work is that, for a target AI, semantic elements originating from training data typically exhibit stronger traceability across their information isotopes, while this pattern is substantially degraded for those from non-training data. This distinctive traceability serves as both necessary and sufficient evidence for identifying unauthorized data usage in opaque AI systems.
To substantiate this claim, we investigate the traceability characteristics of semantic elements through their associated information isotopes in AI training. We conduct experiments using four representative opaque AI APIs, including GPT-4o, GPT-3.5, Claude-3.0-haiku, Gemini-1.5-pro-001, and a dataset comprising 600 public news articles sourced from three renowned press outlets. Half of these articles were published in 2022 (a time period commonly associated with AI training datasets) and can be found in the Common Crawl dataset, a corpus widely used to train LLMs, serving as the training data. The remaining half were published in 2024, which is after the knowledge cut-off dates of these models, serving as non-training data. The details on the API version and corresponding knowledge-cutoff date are provided in Supplementary Table 3. We first mask a semantic element from the data, and prompt the AI with the data contexts to recover the target element from some ambiguous information isotopes. In addition to the temporally partitioned benchmark textual dataset, we also curate a new code dataset that does not follow a time-based partitioning scheme. Specifically, we utilize the Stack dataset34, which comprises open-source code repositories collected from GitHub and is widely adopted in training code-focused AI models (e.g., Code LLaMA35). From this dataset, we extract 365 C language code files to represent the training data for the target opaque AI models (positive samples). To construct a representative set of non-training data, we additionally collect 365 C language code files from a private code corpus. These files were uploaded in 2021 and, given their timing and private nature, are unlikely to have been included in the training corpora of commercial LLMs or generated by the commercial LLMs. As such, they can serve as the negative samples in this code dataset. Moreover, we analyzed the semantic similarity distributions of both the member and non-member data to ensure that they are well aligned and do not introduce systematic biases that could confound membership detection. Detailed results are provided in Supplementary Fig. 7. In addition to the four widely adopted commercial LLMs referenced earlier, we include four recently released models, DeepSeek, GLM-4, Grok-3, and Qwen-3, to broaden the scope of this code data-based evaluation. We also replace the Gemini-1.5-pro-001 model with its newer version, Gemini-1.5-pro-002. For our analysis, each line of code is treated as an individual semantic element. We randomly mask one such element and prompt the target AI to recover the masked content from a set of semantically confounding alternatives. More details of these two datasets are provided in the Methods section and the Supplementary Information. Furthermore, we calculate the RSR for training and non-training data to evaluate the traceability of information isotopes.
Figure 1B presents the results for elements extracted from code data and textual elements of four part-of-speech categories (i.e., noun, verb, adverb, and adjective). The results demonstrate that foundation AIs effectively recover target semantic elements from their information isotopes, even when these isotopes encode consistent semantics. For instance, the average RSR across five categories of semantic elements within the training dataset is approximately 75.5%, significantly surpassing the baseline random guess rate. This finding underscores the memory capacity of large-scale AI models, which enables them to trace specific semantic elements present in training data. Moreover, we observe a marked degradation in the average RSR for non-training data compared to training data. For example, the averaged RSR for noun-type (NN) semantic elements in the training dataset is 78.1%, whereas for non-training data it drops to 52.6%. This discrepancy is attributed to the semantic similarity among a group of information isotopes, making it significantly challenging for AIs to recover specific non-training semantic elements based solely on general knowledge without relevant memory traces on the corresponding content. These results demonstrate that semantic elements possess significant traceability properties among their information isotopes in AI training processes, revealing the potential of information isotopes to differentiate between training and non-training data solely through access to AI generation within opaque AI systems. Moreover, we further demonstrate that elements corresponding to certain part-of-speech categories (e.g., conjunction) exhibit comparable traceability in both training and non-training data in Supplementary Fig. 4. This finding underscores the importance of selectively incorporating semantic elements for tracing data usage during AI training.
Next, we demonstrate that the traceability of information isotopes is not a trivial characteristic. To substantiate this claim, we explore an intuitive detection framework by simply analyzing the AI continuation output. Specifically, we present the AI models with segments of training data prefixes and task them with generating the remaining content, then compute the similarity between the AI-generated continuations and the corresponding ground-truth texts. The similarity scores are measured using three metrics, including ROUGE36, BERT similarity (BERT)37, and edit distance (ED)38. The results in Fig. 1C indicate that the average continuation similarity scores derived from training data closely approximate those from non-training data, and the statistical test (t-test) suggests negligible differences between the two distributions (p-value > 0.05). This occurs because state-of-the-art (SOTA) AIs are designed to avoid verbatim reproduction of their training data, which substantially reduces the similarity. It proves challenging to locate AI-generated content that includes multiple consecutive words identical to those in the raw content of training data which surpasses the threshold that might be considered plagiarism. These findings underscore the essential role of information isotopes in identifying training data within opaque AI systems.
InfoTracer: information isotope tracing mechanism
We investigate the problem of auditing the training data of fully opaque LLMs, specifically aiming to infer the membership status of a given data entry by analyzing the model-generated content. A formal problem definition is presented in the “Methods” section. Drawing inspiration from the isotope labeling techniques commonly used in scientific experiments to trace microscopic matter, we introduce InfoTracer, an information isotope tracing method designed to detect training data usage in opaque LLMs. The key insight underpinning InfoTracer is that certain semantic elements, when included in the training data, exhibit strong traceability across their contextual alternatives, which we term information isotopes. The core of the InfoTracer framework is a semantic confounding instruction task, which prompts the model to recover a target element from its information isotopes by filling in the mask within the suspected input data. Figure 2 shows that InfoTracer contains four main components.
The framework comprises four components. A Semantic element selection. It selects fine-grained semantic elements from the input data that exhibit distinct traceability characteristics between training and non-training data, serving as potential indicators of memorization by the AI model. B Context-aware information isotope generator. It produces semantic confounding alternatives (referred to as information isotopes) for a given target semantic element by incorporating its surrounding contextual information. C Probe quality assessment. It identifies and preserves probes with the highest reliability for detecting the membership of suspected data. D Information-based isotope probing. It evaluates the generation pattern of the target AI by measuring the recovery success rate of the original semantic element in contrast to its isotopic variants, thereby inferring the likelihood that the element was present in training data.
The first step, semantic element selection, segments the input data into basic semantic units and identifies those elements with high traceability across their semantically confounding information isotopes, thereby enabling effective detection of training data. Specifically, given a textual paragraph or document T, we first apply the tokenizer from the NLTK library to decompose the input into a sequence of tokens. We then utilize named entity recognition tools from the spaCy library to aggregate relevant tokens into unified semantic entities. These identified entities, together with the remaining individual tokens, collectively form the semantic elements of T. For code-based inputs, segmentation is based on newline characters, with each line treated as an independent semantic element. The complete set of semantic elements is represented as \({\widehat{{{\mathcal{F}}}}}_{T}=\{\,{\widehat{f}}_{i}| i=1,\ldots,{F}^{{\prime} }\}\), where \({\widehat{f}}_{i}\) refers to the i-th semantic element and \({F}^{{\prime} }\) is the total number of elements. Furthermore, as illustrated in Fig. 1B and Supplementary Fig. 4, different semantic elements exhibit varying levels of traceability with respect to their information isotopes during AI training, contributing different degrees of informativeness for training data identification. Therefore, in the case of textual inputs, we retain only semantic elements corresponding to entity nouns, verbs, adverbs, and adjectives, as these parts of speech exhibit distinct traceability patterns between training and non-training data. Moreover, it is important to emphasize that programming language data differs markedly from natural language in both symbolic structure and semantic distribution, which results in distinct traceability properties. Given that the relative proportions of programming and natural language components can vary across different code-based inputs, we systematically exclude human-language comments from the segmented code lines to improve the unbiasedness and reliability of selected semantic elements. Subsequently, we uniformly sample from the remaining code lines to construct the semantic element set. Finally, the resulting set of selected semantic elements is denoted as \({{{\mathcal{F}}}}_{T}=\{\,{f}_{i}| \,{f}_{i}\in {\widehat{{{\mathcal{F}}}}}_{T},i=1,2,...,F\}\), where F denotes the total number of selected semantic elements.
The second step, context-aware information isotope generation, aims to produce a set of semantically similar alternatives, referred to as information isotopes, for a given semantic element within a specified context. The core motivation lies in the observation that the semantic interpretation of a phrase can vary significantly depending on its surrounding context. For instance, the term “apple” may denote either a fruit or a technology company, depending on the usage context. Thus, traditional dictionary-based synonym generation methods are insufficient for constructing semantically confounding isotopes that remain faithful to the contextual nuances of the original element set \({{\mathcal{F}}}\) within the document T. To overcome this limitation, we employ a generative language model to dynamically produce context-sensitive information isotopes. For each semantic element \(f\in {{\mathcal{F}}}\) within a data entry T, we first construct a masked version of the text, denoted as Tmask, by removing the target element f. The generative model is then prompted with Tmask to produce plausible candidate completions for the missing content. By sampling multiple times, we obtain a collection of contextually grounded isotopes for element f: \({{\mathcal{I}}}(\,f,T)=\{{i}_{k}| k=1,2,\ldots,I\}\) where I denotes the number of generated isotopes. These isotopes are designed to maintain high semantic similarity with the original element while introducing contextual ambiguity, thereby increasing the difficulty for a model to accurately infer the correct completion, especially when the content has not been explicitly memorized during training. Moreover, analyzing the traceability of a selected semantic element within its corresponding set of information isotopes offers a principled basis for auditing whether a specific data entry has been used during the AI training process.
The third step, probe quality assessment, aims to construct probing queries based on the target elements and their corresponding generated isotopes, and subsequently select high-quality probes for reliable detection of the membership status of the suspected data. Specifically, in practical scenarios, we can only access the generated content of the target AI in response to specific input prompts. The internal probabilities associated with generating a particular information isotope given a prompt are cached within the AI system and are not externally accessible. Furthermore, the tokenizer employed by the LLM is generally unknown, complicating efforts to compute generation probabilities using standard next-token prediction methods. To overcome these limitations, we propose an estimation strategy based on querying the target AI via a multiple-choice instruction. Specifically, the model is prompted to select the most appropriate semantic element from a set of information isotopes \({{\mathcal{I}}}(f,T)\) to complete a masked segment of text. Notably, the LLM is instructed to return only the choice index (i.e., “A”, “B”, “C”, or “D”), while responses involving invalid sampling are discarded. This strategy is designed to mitigate the impact of the black-box characteristics of the LLM, particularly its sampling behavior over an unknown vocabulary, on the accuracy of the estimation. The query prompt template \({{{\mathcal{Q}}}}_{f}\) is based on the following format: Here is a masked paragraph: [T]. We know that the correct answer to fill the mask is among the following candidates: [S]. Please help me choose the optimal one and answer the correct choice index. The correct answer is the target semantic element f, [T] is replaced by the masked context (Tmask) of the target element, and [S] is replaced by the shuffled set \({{\mathcal{I}}}(\,f,T)\cup \{\,f\}\). Using this framework, we construct a set of probe queries \(\widehat{E}=\{{\widehat{{{\mathcal{E}}}}}_{i}| i=1,\cdots \,,| {{\mathcal{F}}}| \}\) to evaluate the model memorization of the suspected data. However, due to the inherent imperfections in element selection and isotope generation, some constructed probes may be unreliable for membership inference. For example, an isotope may not be fully semantically equivalent to the target element in context, causing the LLM to select the target element based on common sense rather than memorization. To ensure probe reliability, we design a multi-view probe selection mechanism that evaluates probe validity along three complementary dimensions. (1) Contextual leakage, assessing whether the surrounding context provides sufficient cues to distinguish the target element from its isotopes; (2) Isotope confusability, quantifying the extent to which generated isotopes obscure or mislead the identification of the target element; (3) Target frequency, prioritizing infrequent target elements that are less likely to be correctly inferred without explicit memorization. Additionally, we leverage an LLM (e.g., GPT-4o) to assess the quality of each probe and adopt an in-context learning (ICL)-based scoring strategy that enables the model to learn from human-labeled examples and align its judgments with human preferences. This strategy assigns an overall quality score to each probe query \({\widehat{{{\mathcal{E}}}}}_{i}\), from which a unified ranking score is subsequently derived. Only the top-ranked probes are retained for downstream detection, forming the final probe set \(E=\{{{{\mathcal{E}}}}_{i}| {{{\mathcal{E}}}}_{i}\in \widehat{E},\,i=1,\ldots,M\}\), where M denotes the number of selected high-quality probes. We also note that the ICL examples are distinct from the evaluation data, and further details are provided in Supplementary Information.
The fourth step, information isotope-based probing, is designed to detect the membership status of suspected data samples based on the selected probe queries E. For each probe query \({{{\mathcal{E}}}}_{i}\), the target LLM is instructed to recover the masked target element from its corresponding set of information isotopes. This probing process is repeated Q times, each with a randomized ordering of the candidate choice group \(({{\mathcal{I}}}(\,f,T)\cup \{\,f\})\), resulting in a collection of model responses \({{{\mathcal{C}}}}_{{{\mathcal{E}}},j}\), where j = 1, …, Q. Each generated response is compared with the original content to obtain a binary observation sequence \(\{{\widehat{o}}_{{{\mathcal{E}}},j}\in \{0,1\}| \,j=1,\ldots,Q\}\), where \({\widehat{o}}_{{{\mathcal{E}}},j}\) denotes whether the j-th response correctly identifies the target element for the probe \({{\mathcal{E}}}\). The empirical success rate of recovering the target element under probe \({{\mathcal{E}}}\) is then computed as: \({\widehat{q}}_{{{\mathcal{E}}}}=\frac{1}{Q}{\sum }_{j=1}^{Q}{\widehat{o}}_{{{\mathcal{E}}},j}\). To quantify the overall traceability signal, InfoTracer aggregates the RSRs across all probe queries in E associated with a data entry T or dataset \({{\mathcal{D}}}\), yielding an activation score defined as: \(\widehat{q}=\frac{1}{| E| }{\sum }_{{{\mathcal{E}}}\in E}{\widehat{q}}_{{{\mathcal{E}}}}\). This aggregated activation score serves as an empirical indicator of potential training data usage by the target AI model. Membership detection is then performed by comparing the activation score \(\widehat{q}\) against a predefined threshold θ, where a data sample is classified as a member if \(\widehat{q} > \theta\). Following common practices in prior membership inference studies22,28,39, the threshold θ is calibrated using the activation score distribution estimated from a small validation set of non-member data. Recognizing the importance of providing strong and statistically supported evidence for claims of unauthorized data usage, InfoTracer also evaluates the statistical significance of detection results to ensure their robustness and reliability. A detailed case illustration on the workflow of our method is shown in Supplementary Fig. 6. Further methodological and implementation details are presented in the Methods section.
Performance evaluation on open-source models
In this section, we evaluate the membership inference performance of our proposed InfoTracer method in comparison with existing baseline methods. Since obtaining definitive ground-truth membership samples for fully closed-source models remains inherently challenging, we begin our evaluation using open-source LLaMA-1 models, for which the training corpus is publicly documented40. Following the widely adopted evaluation protocol established in prior studies22,23,39, all experiments are conducted on the WikiMIA dataset. In this benchmark, the membership samples correspond to Wikipedia articles that are part of the LLaMA-1 pre-training corpus, while the non-membership samples are derived from Wikipedia documents published after the release of LLaMA-1. The evaluation encompasses four model scales (7B, 13B, 30B, and 65B) and four detection data lengths (32, 64, 128, and 256 tokens), allowing us to systematically examine the impact of model size and input length on inference performance. This setup enables a clear distinction between known and unseen data, thereby providing a reliable basis for evaluating the accuracy and robustness of membership inference methods.
In addition, we select six representative baseline methods for comparison. The first group consists of adapted gray-box MIA methods. To emphasize the challenges of detecting training data in a purely black-box setting, we design black-box variants of two state-of-the-art gray-box detection methods: Neighbour41 and SPV-MIA27. Both methods originally rely on the difference in gray-box likelihood features between an input sample and its perturbed variant to infer membership. To adapt them for black-box evaluation, we modify their procedures by prompting the target LLM with both the original and perturbed inputs and instructing it to select the version it is more confident about. The comparison between InfoTracer and these adapted methods further highlights the inherent difficulty of constructing perturbation-based black-box MIAs. The second group includes two recent label-only MIA methods tailored for LLMs, i.e., PETAL29 and DPDLLM28. Their core framework depends on using a surrogate model to approximate the generation probability distribution of the black-box target LLM. In contrast, InfoTracer is surrogate-free, which enhances its generalization capability across different models and datasets. Besides, we also include the other two label-only baseline methods: SimNGram42 that examines continuation similarity, and DE-COP43 that examines model memorization on title-content pairs.
Detection performance comparison is illustrated in Fig. 3A (histogram plot). First, across different model sizes and data lengths, InfoTracer consistently demonstrates superior detection performance, outperforming both the gray-box method variants and black-box methods. This superiority across diverse configurations provides strong empirical evidence for the traceability properties of information isotopes, which enable the identification of training data without requiring access to the model’s generative probability distribution. Second, the gray-box method variants are less effective when adapted to the black-box setting, even performing comparably to random guessing. This observation suggests that while differences in gray-box likelihood features between original and perturbed samples serve as strong membership indicators when probability distributions are accessible, their confidence-based adaptations lose discriminative strength in purely black-box contexts. In such settings, the attacker can only access coarse-grained output signals, severely limiting methods originally designed to exploit fine-grained internal statistics. For instance, SPV-MIA identifies membership by analyzing token-level logit variations after input perturbation. By contrast, under black-box constraints, the model can only be queried to choose between original and perturbed inputs, without revealing token-level details. As a result, these subtle membership cues vanish, leading to a marked decline in detection performance. Third, the recent label-only methods such as PETAL and DPDLLM, which depend on a surrogate model to approximate the probability distribution of the target LLM, achieve relatively competitive results among baseline methods. Nevertheless, their reliance on surrogate model approximation introduces a distributional mismatch problem, where detection performance heavily depends on how well the surrogate aligns with the true target LLM, leading to unstable performance across model sizes and data lengths. In contrast, InfoTracer is entirely surrogate-free and thus avoids this source of bias, maintaining consistently superior detection accuracy across different model scales and datasets. This robustness underscores the potential of InfoTracer as a generalizable and practical solution for auditing training data in opaque LLMs. Fourth, results show that DE-COP fails to provide reliable detection under this detection scenario, since the WikiMIA dataset contains only short, standalone paragraphs without associated document titles, a setting that directly violates the core assumption required by DE-COP. In contrast, our method consistently achieves substantially higher accuracy across multiple datasets and model architectures, reinforcing that our method generalizes more robustly to practical, real-world auditing settings where short-form content and missing document metadata are the norm. Fifth, InfoTracer demonstrates greater efficacy with longer detection sequences, as extended text provides a richer reservoir of traceable information isotopes that its selection mechanism can effectively exploit. Notably, InfoTracer maintains strong performance even on short inputs, achieving an accuracy of 63.6% for 32-token entries (LLaMA-65B, K = 1), highlighting its remarkable sensitivity in membership inference. Moreover, the performance advantage of InfoTracer over baseline methods further increases with larger model sizes. Collectively, these results underscore the practical efficacy and scalability of InfoTracer as a reliable auditing tool for safeguarding data rights in the era of ever-expanding large-scale models.
We evaluate detection accuracy based on the LLaMA models and WikiMIA datasets. A The histogram plot reports AUC scores for single-entry detection (K = 1), while the ROC curves present multi-entry detection performance of InfoTracer versus the optimal method among baselines. InfoTracer consistently and substantially outperforms all baseline methods, achieving over 99.99% detection accuracy with only 40 data entries (WikiMIA-128), roughly equivalent to the length of a four-page academic paper. This remarkable performance underscores the practical sensitivity and efficacy of InfoTracer, even under highly data-constrained auditing conditions. B Detection performance under varying suspected data sizes. We present the results averaged across target models based on WikiMIA-128. Our method exhibits improved detection accuracy as more data entries are examined, highlighting superiority over baselines. C Statistical significance analysis. While baseline methods show minimal significance even with 50 entries, our method rapidly achieves strong statistical evidence of training data usage, underscoring its robustness in real-world auditing scenarios. D Visualization of detection features of InfoTracer. A clearer separation between the detection features for training and non-training data as the number of examined entries (K) increases, indicating enhanced discriminative power.
Set-level detection performance and statistical significance
Next, in real-world applications, it is usually more practical to identify the use of a batch of unauthorized data rather than a single data entry in AI training. Thus, we further evaluate the performance of different methods in detecting the utilization of a dataset of size K for AI training. Specifically, we compute the average information isotope activation or similarity scores across entries in the target dataset as the detection metric for InfoTracer and the baseline methods, respectively. Due to space limitations, Fig. 3B only reports the average detection performance across target models for WikiMIA-128, and Fig. 3D displays the detection feature (activation scores) distribution of InfoTracer averaged across AI models, further illustrating the effectiveness of the proposed method. We observe that the baseline methods continue to exhibit limited efficacy in detecting datasets used for AI training under varying data volumes. This finding further highlights that these baselines are ineffective at detecting data usage from AI generation, indicating they are impractical for opaque AI systems. In contrast, the performance of InfoTracer exhibits a consistent improvement with an increasing training dataset size. This enhancement arises because datasets with more training data can amplify information isotope activation, thereby providing more robust signals for detection. Notably, the InfoTracer method achieves a detection accuracy of around, or even exceeding, 99% across target models when the size of the suspected dataset surpasses 30 entries (totaling 4000 words, equivalent to the average length of a four-page academic paper). Furthermore, the detection feature distribution shown in Fig. 3D reveals that the separation between positive and negative samples becomes substantially more distinct when information isotopes from 50 data entries are incorporated into the analysis. These results underscore the high sensitivity of InfoTracer, which can reliably evidence the use of training data with significant accuracy even when relatively small datasets are available.
Figure 3C presents the statistical significance (expressed as the p-value) of InfoTracer compared to that of baseline methods measured by two-sided t-test, under varying token volumes of the suspected data. The results presented in Fig. 3C are based on WikiMIA-128 and averaged across all evaluated target AI models. First, the baseline methods exhibit limited detection capability. For instance, even when 50 data entries (approximately 6000 words) are available for detection, the p-values obtained from the baselines remain consistently above 0.1, indicating a lack of statistical evidence to confirm the detection result. These results suggest that the baselines do not provide sufficient evidence of distinction between member and non-member data. Second, our findings reveal that InfoTracer achieves significant detection capabilities even with limited data volumes. Specifically, the p-values associated with InfoTracer consistently remain below the threshold of 0.05 across a range of AI models even when only approximately 30 data entries are available, equivalent to 4000 words (less than half the length of this paper). This is because the traceability properties of information isotopes from training and non-training data are highly distinct, which can serve as critical indicators of membership. Third, as the volume of under-detected data increases, the detection significance improves at an approximately exponential rate, evidenced by the corresponding p-value exhibiting exponential decay. This phenomenon arises from the capability of our InfoTracer method to effectively amplify the informative signals embedded within the information isotopes, thereby enhancing the detection of training data. These findings underscore the superiority of our method in addressing critical detection scenarios involving large-scale data leakage.
Detection precision under varying data base rates
In real-world auditing scenarios, the distributions of member and non-member samples within suspected datasets are often highly imbalanced. A particularly challenging case arises when the positive class, i.e., data points that truly belong to the training set, has a very low base rate. Under such conditions, the ROC-AUC metric alone may fail to adequately capture the practical effectiveness of different detection methods. To address this limitation, we further evaluate the detection precision of InfoTracer and the most competitive baseline method, PETAL, across a range of base-rate settings. Here, the base rate is defined as the proportion of member samples relative to non-member samples, and we include evaluations under extremely sparse conditions (e.g., a 1% base rate) to better reflect realistic auditing scenarios. We report Precision@1%FPR for both InfoTracer and PETAL using the LLaMA models and WikiMIA dataset across different detection data token sizes (N). The corresponding results are presented in Fig. 4, from which several key observations can be derived. First, the results indicate that as the database rate decreases (i.e., from 1:1 to 1:100), the detection precision of both InfoTracer and the baseline method declines, highlighting the increasing difficulty of this task under highly imbalanced conditions. Second, once the monitored text length reaches a sufficient scale (e.g., 4000 tokens), InfoTracer consistently achieves exceptionally high precision (above 99%) with a false positive rate below 1%, even under highly imbalanced settings where the ratio of member to non-member samples is as low as 1:100. Moreover, as the amount of content available in the suspected data increases, the detection precision of InfoTracer continues to improve. These findings collectively demonstrate the robustness and reliability of InfoTracer in providing strong and trustworthy evidence for data auditing in real-world deployment scenarios. Third, InfoTracer consistently outperforms baseline methods across a wide range of target models, base-rate settings, and detection data sizes. This superiority arises from its methodological advantage over existing label-only MIAs: unlike prior approaches, InfoTracer does not rely on surrogate models to approximate the generation behavior of the target model. As a result, InfoTracer achieves stronger generalizability and maintains stable performance across diverse and challenging detection scenarios.
Results based on the LLaMA models and WikiMIA datasets demonstrate that InfoTracer achieves high precision (e.g., 99.5%) while maintaining a false positive rate below 1%, even under extremely low base rates (e.g., 1:100). These results highlight the robustness and reliability of InfoTracer in delivering strong and trustworthy evidence for data auditing in practical deployment scenarios.
Generalization of InfoTracer across data distributions
Next, we evaluate the performance of InfoTracer across two critical data domains, i.e., medical data and copyrighted books, to investigate its generalizability. Notably, LLaMA and other commercial AIs examined in prior experiments do not disclose the use of medical data in their training processes. Thus, we employ a specialized medical LLM44 as the target AI and utilize the medical data used during its training as the target detection dataset. Furthermore, the technical report of the LLaMA model40 specifies that the Book3 dataset45 is included in its training data. Therefore, we evaluate the performance of InfoTracer in detecting copyrighted book data using Book3 and the LLaMA model. Results are presented in Fig. 5A and B. We find that InfoTracer significantly and consistently outperforms baseline methods under these experimental conditions. For instance, InfoTracer detects the use of medical data in AI training with an accuracy score of 99.0% and exhibits distinct, discriminative detection features characterized by clear decision boundaries. These underscore the generalizability of InfoTracer in detecting training data across distinct critical domains of private and copyrighted data, further highlighting its effectiveness in practice.
A Generalization across data distributions. We evaluate the performance of InfoTracer against the strongest baseline, PETAL, using ROC curves across three distinct datasets, i.e., Book3, Medicine, and Book-IID. Results demonstrate that InfoTracer consistently achieves superior discrimination capability across heterogeneous data domains, underscoring its strong generalization ability. B Detection feature separability. The detection features distribution of InfoTracer (i.e., activation scores) reveals a clear separation between training and non-training data when using 30 probing samples (K = 30). C Robustness under adversarial attacks. We evaluate the robustness of InfoTracer against both rephrasing- and replacement-based adversarial attacks on WikiMIA-128 dataset. Even under severe perturbation intensities (e.g., α = 49%), InfoTracer consistently preserves high detection accuracy, outperforming the best-performing baseline methods. These results demonstrate the practical reliability of InfoTracer for real-world auditing applications. We present the baseline performance only under moderate attack settings (α ≤ 20%), as its effectiveness deteriorates to near-random levels under stronger attacks.
Moreover, in previous experiments, a potential distributional bias existed between member and non-member data, which might allow the detection algorithm to exploit spurious shortcuts. To mitigate this issue, we conducted controlled experiments using an additional dataset in which member and non-member samples follow an independent and identically distributed (i.i.d.) setup. Specifically, we selected non-membership samples from the novel-book dataset45 as an initialization, which contains literary works published after the release of LLaMA-1. Each selected article was randomly divided into two halves, and LLaMA-1-7B was fine-tuned exclusively on the first half. This process yielded a new dataset, termed Book-IID, where both member and non-member samples are drawn from the same distribution with respect to the fine-tuned model. The detection performance was then evaluated by contrasting the fine-tuned (exposed) portions with their corresponding untouched (unseen) halves. To better reflect realistic deployment scenarios, the model was fully fine-tuned over all parameters without employing any parameter-efficient adaptation techniques. Results (Fig. 5A, B) demonstrate that, under the i.i.d. setting, the performance of the most competitive label-only MIA baseline degenerates to random guessing, whereas our method continues to effectively distinguish training data from non-training data, achieving an AUC score of 61.5 (K = 1). This degradation arises because, under identical member and non-member distributions, the discrepancies between the surrogate model used by PETAL and the target LLM prevent the surrogate from accurately capturing the subtle probability differences of the target model in generating these two categories. In contrast, our method is surrogate-free and leverages information isotopes to directly probe the target model in memorizing training data, thereby demonstrating superior robustness and generalization. These findings further confirm that the detection capability of our approach stems from genuine “training exposure” rather than superficial “conceptual familiarity” or “style familiarity”.
Robustness of InfoTracer under adversarial attacks
The core detection principle of InfoTracer lies in tracing specific information isotopes embedded within training data, which can potentially be affected by perturbation-based adversarial attacks46,47,48. To systematically evaluate the robustness of InfoTracer, we assess its performance under two representative adversarial scenarios: data rephrasing and data selection attacks. Following prior work on rephrasing attacks47,48, we randomly replace an α% proportion of tokens in the training data with their synonyms to simulate semantic-preserving perturbations. In addition, inspired by recent studies on watermark resilience46, we emulate situations where only a portion of a target owner’s data is intermingled with external data during model training; here, the membership proportion within the dataset to be detected is also represented by α. We vary the attack intensity from moderate (α = 10%) to severe (α = 49%) to comprehensively test the algorithm’s robustness.
As shown in Fig. 5C, replacement-based attacks of moderate strength exert only marginal influence on the performance of InfoTracer. For example, under the LLaMA-30B model with K = 30 suspected data entries, the detection accuracy declines by less than 5% when 10% of tokens are replaced. Moreover, InfoTracer consistently surpasses the strongest baseline across all attack settings and retains high accuracy even under extreme attacks (e.g., α = 49%), underscoring its robustness against adversarial manipulations. The underlying mechanism behind this phenomenon can be intuitively explained as follows. When a portion of the words from the training data is rephrased, the membership signals derived from those altered elements may shift from “1” to “0”. Nevertheless, the remaining unmodified words continue to contribute strong membership cues (with signal strength “1”) to the overall detection process. Consequently, compared with non-member samples (whose corresponding detection signal strengths remain near “0”), our method can still reliably distinguish member data, thereby exhibiting robust performance even under substantial rewriting perturbations. Furthermore, this analysis suggests that, for a given attack intensity, increasing the number of tokens involved in the detection process can effectively compensate for the reduced individual signal strength, thereby stabilizing the overall detection accuracy. Further theoretical analysis (Lemma 2) quantifies this trade-off, showing that involving an additional \(\frac{2\alpha -{\alpha }^{2}}{{(1-\alpha )}^{2}}\) of detection data entries effectively offsets the performance degradation resulting from replacing α% of the training data. These findings suggest that the robustness of InfoTracer can be significantly enhanced by supplementing it with a moderate volume of detection data, which is typically accessible in real-world cases. We also report the performance of our method under an extremely strong attack scenario (i.e., α = 99%) in Supplementary Fig. 8.
In addition, we conduct a comprehensive analysis of the methodological design underlying InfoTracer. First, an ablation study shown in Supplementary Fig. 1 on the multi-view probe selection mechanism demonstrates its effectiveness in identifying high-quality probe queries and enhancing detection performance. Supplementary Fig. 2 also shows consistently high agreement on the probes selected by different LLMs, alongside consistent performance improvements, indicating that the selection process is indeed grounded in subjective content reasoning rather than idiosyncratic stylistic preferences of any specific model. Second, we further conduct an accuracy-efficiency analysis on the long-form book dataset, where only a subset of tokens from each book was used for detection. Supplementary Fig. 3 shows that detection accuracy increases rapidly as the number of sampled tokens grows and soon saturates at 100%, whereas the inference time increases approximately linearly. These results demonstrate that partial-content sampling can substantially reduce computational cost without compromising detection performance. Third, we examine the impact of entity recognition quality on overall detection performance (Supplementary Fig. 9), revealing the robustness of InfoTracer to variations in semantic extraction accuracy. Fourth, we perform an ablation study on semantic element selection across different parts of speech (Supplementary Fig. 4), elucidating how linguistic categories contribute to detection signals. Fifth, for completeness, we compare InfoTracer with representative gray-box MIA methods (Supplementary Fig. 5) to contextualize its black-box advantages.
Detecting training data of opaque AI systems
In this section, we assess the effectiveness of different methods in detecting training data from state-of-the-art closed-source LLMs. Following prior studies29,49, we assume that for a given opaque LLM, data samples collected from public sources prior to its knowledge cut-off date are highly likely to have been used for training, and thus serve as suspected membership data. Conversely, samples published after the model knowledge cut-off date from the same source are treated as non-membership data. While this evaluation setup is not perfect, due to potential uncertainties in actual training data inclusion and distributional shifts between member and non-member samples, we argue that it nonetheless provides valuable insights into the practical detection performance of MIAs against frontier closed-source LLMs. Specifically, we conduct two sets of evaluations: (1) Using a news-article dataset (NEWS) in conjunction with four commercial models, namely GPT-3.5, GPT-4o, Claude-3.0-haiku, and Gemini-1.5-pro-001. (2) Employing a code-program dataset (CODE) for cross-model evaluation across eight commercial LLMs, including GPT-3.5, GPT-4o, Claude-3.0-haiku, Gemini-1.5-Pro-002, Qwen-3, DeepSeek-V3, GLM-4, and Grok-3. All experiments are independently repeated 20 times to ensure their reliability. Performance on detecting the use of a single data entry for AI training is presented in Fig. 6A (histogram plot), from which we draw two main findings. First, results reveal that the baseline method performs only marginally better than random guessing. The strongest method among recent baselines, PETAL, exhibits lower performance on frontier closed-source models compared to open-source ones. This is because the generative distributions of these opaque systems are substantially harder for surrogate models to approximate, resulting in a pronounced distributional mismatch and degraded inference accuracy. Second, the results demonstrate that InfoTracer substantially enhances the performance of baseline approaches, offering significant improvements in the detection of data used in AI training. Results show that our method significantly enhances detection informativeness and achieves robust identification performance across large-scale, closed-source AI models.
Evaluation is based on three settings: the CODE dataset across eight commercial AI models, the NEWS dataset across four commercial AI models with aligned knowledge cut-off dates, and a large-scale novel corpus. A Evaluation on the CODE and NEWS datasets. The AUC curves compare InfoTracer with the strongest label-only baseline MIA method (PTEAL). InfoTracer substantially outperforms the baseline, whose performance remains largely stagnant as the number of examined data entries increases. In contrast, the detection accuracy of InfoTracer improves sharply, exceeding 99% AUC on the NEWS dataset with only 50 entries (approximately eight pages of text), highlighting its effectiveness under limited evidence and its practical auditing potential. B Evaluation on a large-scale novel dataset comprising twenty-one novels (approximately 1M tokens) shows that InfoTracer achieves a perfect separation between member and non-member samples, exhibiting clearly distinct detection feature distributions. These results underscore the strong scalability of InfoTracer and its practical applicability to real-world long-form textual data.
Furthermore, we construct a large-scale benchmark that is less sensitive to temporal shifts for robust evaluation. Importantly, the freshness of a novel’s content with respect to an LLM is not necessarily determined by its publication date, but rather by its inclusion in the training corpus. For example, a novel published after the model knowledge cut-off may describe historical events or fictional scenarios that are temporally detached from its release date. Based on this observation, we compile a dataset comprising 12 classic Chinese novels published prior to 2017 as representative training data, and 9 recently published Chinese novels (after December 2024) as representative non-member samples, yielding a combined dataset of over one million tokens. We conduct empirical evaluations on Doubao-1.5-pro, a widely used commercial LLM developed by ByteDance and publicly released in January 2025. Since its exact knowledge cut-off is undisclosed, we treat publication date as a conservative proxy for membership status. Results in Fig. 6B and Supplementary Table 4 demonstrate that our method can perfectly distinguish the membership status of these long-form novel texts, achieving an accuracy of 100%, with significant detection features. These findings further highlight the strong practical potential of our method for real-world applications involving long-form textual data.
Discussion
In the current era of rapid AI advancement, many AI institutions have entered a competitive race to amass ever-expanding quantities of data for training the most advanced AIs with super intelligence6,7,50. The Internet, which encompasses a wide array of content from personal data (such as social media posts) to professional works (like news articles and art creations), serves as a source of rich human knowledge and is extensively utilized for AI training8. Although much of this data is publicly accessible, it may include sensitive information, such as personal privacy or copyrighted content, the unauthorized use of which in AI training has led to an increasing number of data infringement cases18,19,20. Furthermore, due to the inherent characteristics of the data-driven learning paradigm, many AI systems may inadvertently generate content that closely resembles their training data when responding to user queries. Given the widespread application of AI tools across various sectors of society, this flaw facilitates the broader dissemination of unauthorized data and creates unfair competition for human creators. If left unresolved, such infringement could diminish the motivation of creators and impede progress across various fields. Furthermore, such a similar generation also significantly exposes AI users to potential risks of data infringement and may erode user confidence in AI tools, which ultimately diminishes the benefits of advanced AI techniques for our society. Therefore, there is an urgent need for a sound framework to regulate such unauthorized data usage. Moreover, unlike traditional data breaches, many advanced AI systems exhibit significant opacity, with unauthorized data usage often occurring in a covert manner. This makes it challenging to detect and recognize instances of misuse, let alone to gather conclusive evidence of unauthorized data exploitation. Such challenges underscore a critical vulnerability within current legal frameworks for data rights protection in the context of real-world cases involving AI applications. Addressing these deficiencies is essential to safeguarding societal equity and mitigating the social and economic risks associated with advanced AI technologies.
The mainstream studies have explored gray-box model features to detect the use of training data by analyzing the intermediate computation variables of AI systems during the generation process. However, such methods are largely impractical in real-world applications, as intermediate variables produced by these AI systems are typically inaccessible. Instead, our study aims to approach the problem in realistic, purely black-box scenarios where only AI-generated outputs are available. Moreover, as evidenced in Fig. 1C, the advanced AIs are optimized to avoid generating verbatim reproductions of training data, making it challenging to establish evidence of unauthorized data usage solely from AI-generated outputs. Although several studies have explored label-only membership inference methods, these approaches typically rely on a surrogate model to approximate the generation probability distribution of the target model, resulting in limited generalization across different models. Therefore, developing surrogate-free approaches for detecting training data in black-box LLMs remains an open and important research challenge. In this study, we introduce the concept of information isotopes and demonstrate their traceability properties within the AI training process. By marking specific semantic elements within the training data, we effectively trace their influence on AI outputs, even in the absence of direct observation of the training process (Fig. 1B). These findings highlight the potential of information isotopes as both informative and interpretable clues for detecting data utilization in AI training solely from generated outputs. This work also establishes a new research direction, expanding the boundaries of existing methodologies by enabling the identification of training data without relying on inaccessible internal computational information within AI systems.
Based on these findings, we propose an information isotope tracing mechanism to audit and provide evidence of unauthorized data usage in AI training by examining AI-generated content. Evaluations (Fig. 3A–D) reveal that InfoTracer achieves effective detection performance with accuracy exceeding 99% and statistically significant evidence (p-value ≈ 10−3) when provided with a small amount of suspected data for analysis, whereas baseline methods degrade to random guessing. These findings highlight that InfoTracer can provide reliable detection evidence of unauthorized training data usage under practical conditions, requiring access only to AI-generated outputs and a limited amount of suspected data for examination. Moreover, experiments (Figs. 5A and 6A, B) also demonstrate the effectiveness of InfoTracer in detecting data across different critical domains (including copyrighted news articles, books, medical data, novels, and code). These results underscore the broad applicability of InfoTracer in addressing diverse real-world detection requirements, particularly for data from various privacy- and intellectual property-sensitive domains. Figure 5C illustrates that InfoTracer maintains robust performance against various replacement-based adversarial attack strategies. Given that data utilization strategies for AI training in real-world applications often remain opaque, these results validate the robustness and generalization capability of InfoTracer under challenging detection conditions. In addition, Fig. 3A also highlights the scalability of InfoTracer, showcasing its potential effectiveness in protecting data rights against larger and more complex AI systems in the future. Furthermore, Supplementary Figs. 1, 4, 5, and 9 provide an in-depth analysis of our proposed information isotope tracing methodology. These findings offer valuable insights into the underlying mechanisms of the algorithm and guidance for better applying this method to address real-world cases. We further release our algorithm as an open-source and user-friendly software tool to inclusively empower individuals to proactively safeguard their data and ensure its responsible use. In conclusion, the findings and the tool provided in this study highlight the practicality and accessibility of InfoTracer for real-world use, allowing even those without expertise in artificial intelligence to challenge large AI organizations and protect their data rights. This democratization of data protection represents a significant step toward creating an equitable environment, particularly for individuals or organizations lacking the financial resources to conduct intricate technical investigations when addressing disputes with large AI organizations. Moreover, beyond its immediate application to copyright-related detection, this work highlights a broader potential for auditing data leakage risks in LLMs. In particular, it offers a post-deployment means to examine whether models fine-tuned on sensitive domain datasets, such as medical or satellite data, inadvertently retain or expose training information through generation. By enabling such assessments under fully black-box access, our approach contributes to the responsible development and oversight of domain-specific AI systems. Future work will explore domain-aware extensions to further support the trustworthy use of sensitive data in large-scale model training.
Besides, in an ideal scenario, the utilization of black-box methodologies may be unnecessary if effective regulatory mechanisms are implemented. For instance, advanced AI developers could be mandated to provide gray-box APIs that expose internal computational features and disclose key aspects of the training process. However, in real-world scenarios, LLMs possess substantial intellectual property value, and both the detailed information pertaining to the inference phase provided by gray-box APIs (e.g., generation likelihoods and probability distributions) and the training corpus are closely linked to the proprietary nature of LLMs. For example, inference-phase details are highly susceptible to exploitation through distillation-based attacks, while the training corpus serves as the foundational asset for LLM development. Given these considerations, enforcing stringent regulatory measures poses significant challenges when weighed against the commercial interests of the AI providers. Furthermore, even if such regulations could be established, it would be exceedingly difficult for regulatory bodies to rigorously verify the accuracy and integrity of the information disclosed through gray-box APIs or training datasets. This limitation could, in turn, introduce novel security vulnerabilities. Therefore, black-box training data detection methods for LLMs remain critical approaches for auditing unauthorized data usage within opaque LLM systems.
Our work is not without limitations. First, the current work primarily focuses on examining the traceability property of textual information isotopes, with the proposed InfoTracer method being applicable exclusively to the detection of textual training data. While textual data remain a primary knowledge source for advanced AI systems, the growing demand for multi-modal AI, encompassing human-produced data such as images, videos, and audio, introduces additional challenges and raises concerns regarding potential infringements on data rights. In future research, we aim to extend the InfoTracer method to accommodate data from these other critical modalities. Building upon the principles established in this work, the detection of audio training data may be achieved by converting audio signals into textual representations, whereas the detection of image and video data could involve masking visual objects and querying the target AI to identify the correct object from semantically similar alternatives. Second, while InfoTracer demonstrates robustness against multiple adversarial strategies, as with prior work, it may still be susceptible to unforeseen attacks. In particular, large-scale data rewriting or adversarial training could weaken detection performance. This limitation is alleviated by several factors. The extremely high cost and time required to retrain large LLMs make targeted adversarial adaptation largely impractical, providing inherent robustness in real-world scenarios. Moreover, expanding the scope of the audited content can effectively mitigate the influence of adversarial rewriting. In parallel, we continue to refine InfoTracer to further enhance its robustness and generalizability. Third, although evaluations on closed-source LLMs offer valuable insights, they necessarily rely on self-curated datasets and may introduce potential biases stemming from multiple sources, including uncertainties in actual training data inclusion and distributional shifts between member and non-member samples (e.g., differences in publication time or data collection channels). In future work, we plan to collaborate with model providers and adopt verified benchmark datasets to enable more accurate and controlled evaluation of diverse MIA methods. We also emphasize that the experiments on closed-source LLMs are intended primarily to demonstrate the practical applicability of our approach in real-world regulatory or compliance scenarios. Importantly, our study includes both an unbiased evaluation setting (i.e., the Book-IID dataset) and a setting with definitive membership and non-membership partitioning based on open-source models. Together, these two complementary protocols provide strong and rigorous evidence for the validity and robustness of our method.
Methods
Experimental settings
Target AI and datasets
In our experiments, we evaluate thirteen widely recognized AI models, including the LLaMA series (four model scales: 7B, 13B, 30B, and 65B), GPT-3.5-turbo, GPT-4o-2024-05-13, Claude-3-haiku, Gemini-1.5-Pro51, GLM-4-Air52, DeepSeek-Chat53, Qwen-354, Grok-355, and Doubao-1.5-pro. The corresponding dataset used for detection evaluation comprises 600 articles sourced from three prominent outlets: The New York Times, NBC News, and CNN News. According to the official documentation for these AI models, their knowledge bases are limited to data prior to December 2022 and exclude any information after December 2023. To evaluate both training and non-training data, we collected a total of 600 public news articles. Specifically, 300 articles published in 2022 were gathered to serve as the training data, which can be found in the Common Crawl dataset, a widely utilized resource for LLM pre-training. Another 300 articles published in 2024 were gathered to serve as the non-training data. In addition to the detailed description provided in the Results section, we also construct a benchmark dataset independent of temporal cut-offs, comprising 730 C-language code files for evaluation. Besides, for scalability evaluations, we leverage the LLaMA-1 series models40 due to the transparency of their training data sources. Knowledge entries from Wikipedia were compiled into the WikiMIA dataset, with each entry categorized as training or non-training data based on the disclosed usage information22,23. In accordance with standard practices22,23, the WikiMIA dataset was further segmented into subsets of text containing 32, 64, 128, and 256 tokens to evaluate the impact of data length on detection performance. Furthermore, the Books3 dataset45, which comprises copyrighted book content and is extensively used for LLM pre-training, is utilized to evaluate InfoTracer in the book domain. Given that the LLaMA model reports the inclusion of Books3 data in its training process, we randomly sampled 200 data entries published prior to 2023 to represent the training data, and an additional 200 data entries published in 2023 to serve as the non-training data, following the existing evaluation protocols22. In addition, we employed the LLaMA-2-7B model, fine-tuned on medical corpus, as the target model for evaluation44. We randomly sampled 100 data entries from the fine-tuning dataset as the training data and another 100 data entries from its evaluation dataset as the non-training data. To further evaluate the performance of our method on extremely long textual data, we constructed a novel dataset comprising one million tokens for evaluation. These datasets are also provided in our code repository.
Baselines and metric
Our work focuses on detecting training data solely from AI-generated outputs, which aligns with real-world detection conditions. In such scenarios, existing detection methods are impractical and therefore not suitable for direct comparison in experimental evaluations. Thus, we incorporate several baseline methods that leverage generation similarity for detection. Specifically, we use the preceding text of human-authored data to prompt the target AI to generate a continuation. We then evaluate the similarity between the AI-generated continuation and its human counterpart. For this purpose, we employ two widely used text similarity modeling methods: ROUGE score36 and BERT-based similarity56. To illustrate the challenges of performing MIAs in black-box settings for LLMs, we implement black-box adaptations of two SOTA gray-box MIA methods, i.e., Neighbour41 and SPV-MIA27, and compare them with our method. Specifically, SPV-MIA leverages differences in generation likelihood, a gray-box feature, between an original input and a variant rewritten by a language model to infer training membership. The study demonstrates that the likelihood degradation induced by rewriting is substantially larger for training samples than for non-training samples. However, likelihood features are inaccessible in many mainstream closed-source models (e.g., GPT-4o). To adapt SPV-MIA to a black-box setting, we query the target LLM with both the original and perturbed inputs, instructing it to select the preferred sequence based on its generated outputs to estimate the likelihood degradation. Similarly, Neighbour reports that training data samples exhibit significantly lower loss values than their perturbed counterparts. Yet, as with log-likelihood, model loss is unavailable in closed-source systems. Therefore, we adopt the data-perturbation strategy from Neighbour and query the target LLM with both the original and neighboring samples, relying on its generation preferences to determine which input is favored. Samples with high similarity scores are subsequently identified as potential training data. We evaluate the detection accuracy using AUC score, following the standard practice in prior work22,24. Statistical significance in Figs. 1B and 3D is assessed using a t-test. For InfoTracer, We employ LLaMA-3-8B to generate information isotopes. For each selected element, the number of generated isotopes is set to I = 3, and the number of independent query trials is Q = 4. A generation temperature of 0.3 is used. Each experiment is repeated 20 times. More details about the datasets and experimental setup are provided in Supplementary Table 2.
Problem definition
In this study, we investigate how to identify the training data of a target AI model based on its generated outputs. Specifically, we consider an opaque AI system, denoted as \({{\mathcal{M}}}\), where the only accessible information is the model output y obtained by querying it with input x, i.e., \(y={{\mathcal{M}}}(x)\). Importantly, we assume that no additional information, including the intermediate computational variables of the model (like model perplexity and representations of the input data), is available. This assumption aligns with real-world scenarios, where access to the internal workings of AI models is typically restricted. Furthermore, given a suspected data entry T of length E, the objective is to design a detection algorithm \({{\mathcal{A}}}(\cdot )\) capable of determining whether the data entry T was used in training the AI model \({{\mathcal{M}}}\) purely through examining its generations. In addition, in many practical scenarios, it is common for unauthorized AI training to involve a collection of data entries from a specific source rather than a single entry. Thus, the detection algorithm \({{\mathcal{A}}}(\cdot )\) is also tasked with analyzing a suspected dataset \({{\mathcal{D}}}\) of size K to determine whether the dataset \({{\mathcal{D}}}\) was used for training the AI model, where \({{\mathcal{D}}}=\{{T}_{i}| i=1,2,\ldots,K\}\).
Theoretical analysis on the detection significance and robustness of InfoTracer
We present a theoretical analysis of the detection significance underlying the InfoTracer method. To facilitate this, we begin by reformulating the InfoTracer process from a probabilistic perspective. Specifically, consider a probe query \({{\mathcal{E}}}\) that asks the model to select a target semantic element f from a set of semantically confounding variants to complete a masked input. Owing to the deterministic inference behavior of LLMs, the true probability \({p}_{{{\mathcal{E}}}}\) with which the model \({{\mathcal{M}}}\) selects the target element f is itself deterministic. However, due to the inherent diversity of semantic elements and their context-sensitive dependencies, these ground-truth recovery probabilities \({p}_{{{\mathcal{E}}}}\) are not uniform but instead governed by an unknown underlying distribution, denoted \({{\mathcal{P}}}\). Moreover, as illustrated in Fig. 1B, the probability of correctly recovering a target semantic element is significantly affected by whether the data entry containing the element was included in the training corpus of the AI model. To model this effect, we introduce a distributional assumption that captures this non-identical traceability phenomenon. For a given probe query \({{\mathcal{E}}}\), we assume that the selection probability \({p}_{{{\mathcal{E}}}}\) with which the AI model selects the target semantic element f from its I information isotopes, follows a distribution that is conditioned on whether the underlying data entry was used for training.
where \({{\mathcal{Y}}}(T)\) is a binary indicator denoting whether the data entry T was used for training the AI model \({{\mathcal{M}}}\), \({{{\mathcal{P}}}}_{t}\) and \({{{\mathcal{P}}}}_{n}\) represent the distributions of the element recovery probability for training and non-training data, respectively. We make no assumptions about the specific forms of \({{{\mathcal{P}}}}_{t}\) and \({{{\mathcal{P}}}}_{n}\), except that the mathematical expectation of the recovery probability for training data exceeds that for non-training data: \({{\mathbb{E}}}_{{p}_{{{\mathcal{E}}}} \sim {{{\mathcal{P}}}}_{t}}[\,{p}_{{{\mathcal{E}}}}] > {{\mathbb{E}}}_{{p}_{{{\mathcal{E}}}} \sim {{{\mathcal{P}}}}_{n}}[\,{p}_{{{\mathcal{E}}}}]\). This assumption can be empirically supported by the results in Fig. 1B. However, in practical scenarios, the true selection probability \({p}_{{{\mathcal{E}}}}\) is inherently hidden within the opaque AI system and is not directly accessible. Instead, we can only obtain a binary random variable o for selection correctness. To accommodate this limitation, we introduce a new random variable q, constructed based on the underlying distribution \({{\mathcal{P}}}\). For notational simplicity, we use \({{\mathcal{P}}}\) to denote either \({{{\mathcal{P}}}}_{t}\) (if the data was used for training) or \({{{\mathcal{P}}}}_{n}\) (if not). We define q as follows:
where \({{\mathcal{B}}}(p)\) denotes the Bernoulli distribution with parameter p, and oj represents the j-th sampled Bernoulli outcome. Evidently, the aggregated variable q, defined as the average of Q independently sampled Bernoulli variables, follows a latent distribution denoted by \({{\mathcal{Q}}}\). For notational simplicity, we use \({{\mathcal{Q}}}\) to denote either \({{{\mathcal{Q}}}}_{t}\) (if the data was used for training) or \({{{\mathcal{Q}}}}_{n}\) (if not). This distribution is implicitly governed by the underlying traceability distribution \({{\mathcal{P}}}\).
Recall the workflow of InfoTracer: the activation score \(\widehat{q}\) is actually computed as the average of N independent and identically distributed (i.i.d.) variables \({\widehat{q}}_{i}\), each sampled from an underlying latent distribution \({{\mathcal{Q}}}\). Specifically, InfoTracer is designed to determine whether a dataset \({{\mathcal{D}}}\), consisting of K entries, was utilized in the training of a target AI model. For each selected semantic element \({f}_{i}\in {{\mathcal{F}}}\) in a data entry, the method estimates its recovery probability \({\widehat{q}}_{i}\) by querying the model with a specially constructed multiple-choice prompt s, repeated across Q independent trials: \({\widehat{q}}_{i}=\frac{1}{Q}{\sum }_{j=1}^{Q}{\widehat{o}}_{i,j},\) where \({\widehat{o}}_{i,j}\in \{0,1\}\) denotes the binary outcome on the selection correctness of the j-th trial. Accordingly, each \({\widehat{q}}_{i}\) serves as a sample independently and identically drawn from the distribution \({{\mathcal{Q}}}\). Aggregating over all N = K ⋅ M selected elements, where M is the number of selected probe queries per data entry, the overall activation score is computed as: \(\widehat{q}=\frac{1}{N}{\sum }_{i=1}^{N}{\widehat{q}}_{i}.\) Due to the i.i.d. nature of the samples, the Central Limit Theorem ensures that \(\widehat{q}\) converges to a normal distribution as N increases, thereby enabling the use of standard statistical tests to assess distributional differences across datasets. Based on this, we employ a one-sided t-test to evaluate the statistical significance between two datasets, as formalized in Lemma 1.
Lemma 1
Let \({{\mathcal{D}}}\) and \({{{\mathcal{D}}}}_{n}\) denote a suspected dataset and a non-training dataset, respectively. InfoTracer computes their corresponding activation scores \(\widehat{p}\) and \({\widehat{p}}_{n}\). The one-sided p-value for testing the significance of their difference is given by:
where N and Nn denote the number of examined semantic elements in \({{\mathcal{D}}}\) and \({{{\mathcal{D}}}}_{n}\), respectively, and s2 and \({s}_{n}^{2}\) represent the empirical variances of the recovery probability estimates within each dataset, i.e., \({s}^{2}={Var}_{{{\mathcal{E}}}\in {{\mathcal{D}}}}({\widehat{q}}_{{{\mathcal{E}}}})\) and \({s}_{n}^{2}={Var}_{{{\mathcal{E}}}\in {{{\mathcal{D}}}}_{n}}({\widehat{q}}_{{{\mathcal{E}}}})\), and Φ( ⋅ ) denotes the cumulative distribution function (CDF) of the standard normal distribution. In practice, following the common protocols established in prior MIA studies22,28,39, we estimate \({\widehat{q}}_{n}\) and \({s}_{n}^{2}\) using a validation set composed of non-training data for the target LLM. We acknowledge that this estimation strategy has inherent limitations. In particular, when the distribution of non-members deviates from that of members (e.g., due to domain shifts or sampling biases), the resulting reference distribution may become biased, thereby affecting the calibration of \({\widehat{q}}_{n}\) and \({s}_{n}^{2}\) as well as the reliability of subsequent significance testing. This limitation has been widely recognized in the MIA literature. In practice, to mitigate this limitation, a non-training set can be carefully curated to approximate the distribution of the suspected target data, with the resulting p-values interpreted as a reliability indicator of potential membership evidence rather than as definitive proof. Reporting detection performance under low false-positive rate regimes is essential to prevent overstatement of the results.
Second, we evaluate the robustness of InfoTracer against replacement-based adversarial attacks designed to diminish the traceability of training data. Such attacks include rewriting attacks, where a subset of tokens in the unauthorized data is replaced with synonyms prior to AI training, and selection attacks where only a portion of the entries from the target dataset \({{\mathcal{D}}}\) is selectively used for AI training. Let α denote the attack intensity, defined as the probability of data replacement. For a given detection significance level characterized by the p-value, the semantic element sizes required by InfoTracer under normal and adversarial scenarios are denoted by N and \({N}^{{\prime} }\), respectively. The following Lemma 2 establishes the approximate relationship between them, which reflects the robustness of InfoTracer under adversarial attacks. The proof of this lemma is provided in the Supplementary Information. Based on Lemma 2, when the attack intensity α is moderate (e.g., α = 0.1), the following approximation can be derived via the Taylor expansion: \({N}^{{\prime} }\approx N(1+2\alpha )\). This analysis demonstrates that the adverse effects of replacement-based attacks can be mitigated through a linear compensation in the amount of examined data within \({{\mathcal{D}}}\). Specifically, introducing additional data entries proportional to 2αN allows for robust detection performance. The results indicate the robustness of InfoTracer against potential adversarial attacks in real-world scenarios can be efficiently enhanced by incorporating a moderate amount of additional examined data, further demonstrating its practicality.
Lemma 2
Under the replacement-based attack scenario, InfoTracer achieves the same detection significance as in the non-attack scenario when \({N}^{{\prime} }\approx \frac{1}{{(1-\alpha )}^{2}}N\) holds.
The analysis shows that when only an α fraction of the target data is used for model training, the same level of detection significance can be maintained by increasing the number of detection samples to 1/(1 − α)2 times the original sample size N. For example, for α = 49%, approximately 2.8N additional samples suffice to recover equivalent detection performance.
Related work
Membership inference attacks for neural networks
Detecting whether a specific data sample or dataset was used during the training of an AI model, known as a MIA57,58, has attracted significant research attention in recent years. With the paradigm shift in AI from classification models to generative models, existing MIA research can be broadly categorized into two distinct phases27,39,59,60. Studies in the first phase39,61,62,63,64,65,66,67,68 primarily target traditional classification models, e.g., ResNet models69. Depending on the level of access to the target model, these MIA methods can be broadly categorized into three types: white-box70,71,72,73, gray-box62,74,75,76, and black-box attacks61,77,78. White-box MIAs assume the auditor has full access to the target model (e.g., its architecture and parameters)70. For example, Leino et al.72 propose retraining a model on proxy data similar to the private training set and comparing activation maps between the retrained and original models to infer membership. By contrast, black-box MIAs operate under stricter assumptions, where adversaries can only query the model and observe its outputs, such as classification labels61,77. Li et al.77 introduce noise into inputs and assess the stability of model predictions, and samples with stable labels are detected as the training data. Gray-box MIAs fall between these extremes, assuming access to limited information like model losses, predicted probabilities, or logits, but not the model parameters39. However, these traditional MIA methods are designed for classification models and cannot be directly applied to advanced generative models, such as LLMs, due to the fundamental differences in architecture and output style (i.e., discrete classification labels vs. free-form text).
Gray-box membership inference attacks for LLMs
Growing concerns over privacy and intellectual property misuse in LLM training have spurred active research on detecting whether specific data was used for training12,79,80,81. Given that many commercial LLMs are closed-source and the technical challenges of black-box detection, most prior work adopts a gray-box setting in which the auditor can access limited internal model inference details (e.g., token losses or logits) for the target inputs. Based on the technical insights, existing gray-box methods fall into two main lines. The first paradigm is based on token likelihoods and relies on the observation that models tend to assign higher probabilities to tokens in data samples they have been trained on22,23,24,25,26. For example, Shi et al.22 exploit this gap in the Min-K% method. They average the lowest K percent of token likelihoods in a sample, and samples with lower scores are marked members. Zhang et al.23 refine this idea in Min-K%++, which normalizes each token likelihood by its full-vocabulary distribution. The second paradigm is perturbation-based detection27,41,82, which leverages the observation that training data and their perturbed versions exhibit more significant differences in generation feature distributions (e.g., likelihood) compared to non-training data. For example, Oren et al.82 propose permuting the order of the training dataset and measuring the change in model log-likelihood between the original and perturbed data for detection. Mattern et al.41 perturb a small subset of words in the original input and compare the resulting model loss to distinguish training data. Fu et al.27 use a language model to rewrite the original data and evaluate the change in generation probability on the target LLM to detect membership. In conclusion, these techniques assume access to the hidden inference features (e.g., model loss, generation likelihood, or probability distribution) of the target model for the tokens in the target sample. However, in real-world applications, AI systems are typically highly opaque, and the hidden features for specific input data are rarely disclosed, which makes these methods impractical for auditing the unauthorized use of data for AI training. In contrast to prior methods, we adopt a full black-box detection assumption and introduce a novel method that relies only on the model-generated content to effectively detect training data. In addition, the core workflow of our method for detecting specific samples lies in designing a semantic confounding instruction task, which prompts the target AI model to recover the target elements from its information isotopes by filling the masked positions of the content where these elements originally appear, drawing connections to the broader class of perturbation-based detection methods. However, our method distinguishes itself by identifying training data through analyzing how variations in information isotope inputs affect the generation patterns via an instruction task. This capability, which has not been demonstrated in prior work, is notably more challenging than detecting shifts in model feature distributions, as model-generated content is substantially less informative and structured compared to feature-level outputs. The empirical evidence presented in Figs. 3 and 6 demonstrates that our method substantially outperforms the black-box variants of two state-of-the-art gray-box methods27,41, thereby supporting this assertion.
Label-only membership inference attacks for LLMs
In addition, there are few works in the label-only setting, all of which rely on surrogate models to approximate the target model’s generation probability distribution28,29,49. For example, He et al.29 employ a surrogate model (e.g., GPT-2) to estimate the likelihood of a target model (e.g., LLaMA or Claude) generating a specific token given a context. Similarly, Zhou et al.28 fine-tune a surrogate on the candidate dataset to simulate the generation behavior of a target model with or without training exposure, and then infer membership using that surrogate. Besides, Zhou et al.30 prompt the target LLM with the input corresponding to the target data and compute the similarity between the model generation and the ground-truth label. They then compare the resulting similarity scores between the suspect model and a set of locally constructed reference models, one fine-tuned on the target dataset and one not, to infer potential data usage. Although these methods can be effective when the surrogate and target models are closely aligned (e.g., GPT-2 vs. LLaMA-7B), their performance degrades significantly for large-scale closed-source LLMs, where no surrogate can reliably replicate the target model’s generation patterns. Moreover, the similarity-based approach30 is restricted to supervised learning scenarios with explicit labels and is inapplicable to pre-training data. In contrast, InfoTracer performs membership inference by using information isotopes to directly probe the memory of the target model, without requiring any surrogate. This design enables the method to generalize effectively across closed-source LLMs with diverse and unknown architectures or training pipelines. Results presented in Figs. 3, 5, and 6 further demonstrate the superior across-target generalization of our method compared with existing label-only methods. Moreover, Hallinan et al.42 propose a simple membership inference method by examining n-gram similarities between the original training samples and AI-generated continuations. However, our empirical analysis reveals that this method exhibits limited performance and generalization across diverse detection scenarios. In addition, Duarte et al.43 propose querying the target model with prompts such as “Which passage is from book X?” to assess whether the model has memorized specific title-content pairs. However, this method inherently relies on a title-content pair input data structure, since the prompt must explicitly reference the candidate document title. Moreover, its detection granularity is fundamentally constrained to long-form data. The results in Fig. 3 further corroborate these limitations. In contrast, our method generalizes more effectively across diverse detection scenarios, including title-free documents and short-form textual inputs. Moreover, Supplementary Table 5 offers a thorough comparison between our work and SOTA MIA methods for LLMs, detailing their underlying assumptions, required features, and empirical evaluation models. An expanded survey of recent MIA techniques for LLMs is also available in the Supplementary Information.
Watermarking for tracing training data of LLMs
Watermarking represents another important line of research in data auditing, providing verifiable evidence of training data usage. The mainstream framework of existing watermarking-based methods involves first embedding watermarks into the target data and subsequently detecting the presence of these watermark signals in a suspected model to determine whether it was trained on the marked data46,83,84. For instance, Wei et al.83 introduce randomized data watermarks, such as random character sequences or visually similar Unicode symbols, into the data prior to release, then test whether these patterns are memorized by the suspected model. However, watermarking-based approaches inherently require modification of the original training data to embed the watermarks, which may compromise data quality. Moreover, such methods are ineffective for detecting unauthorized use of a large amount of unmarked data that have been released. The inserted watermarks may also introduce abnormal patterns into the data, increasing the likelihood of detection and removal by model owners before training. In contrast, membership inference-based auditing methods do not require any modification of the original training data. They can be applied directly in a post-hoc manner to assess whether a model has memorized particular data instances. This makes membership inference more broadly applicable and generalizable in practical auditing scenarios.
Data availability
All of the advanced AI APIs and data used in the study are publicly available and accessible from the following links. Target AIs: (1) LLaMA-1 Models (https://huggingface.co/huggyllama), (2) GPT-3.5-turbo and GPT-4o-2024-11-20 (https://platform.openai.com/), (3) Claude-3.0-Haiku (https://docs.anthropic.com/en/release-notes/api) (4) Gemini-1.5-Pro-001 and Gemini-1.5-Pro-002 (https://ai.google.dev/gemini-api/docs), (5) GLM-4-Air (https://open.bigmodel.cn/dev/api), (6) DeepSeek-V3 (https://api-docs.deepseek.com/), (7) Qwen-3 (https://qwenlm.github.io/blog/qwen3/), (8) Grok-3 (https://x.ai/api), (9) the medical LLM, (10) the context-aware generator used in InfoTracer: LLaMA-3-8B (https://huggingface.co/meta-llama). Datasets: (1) the News dataset (https://github.com/spmede/InfoTracer), (2) the CODE dataset (https://github.com/spmede/InfoTracer), (3) WikiMIA (https://huggingface.co/datasets/swj0419/WikiMIA), (4) Book3 (https://huggingface.co/datasets/YnezT/Tiny-BookMIA), (5) the Medical dataset (https://github.com/spmede/InfoTracer), (6) the NOVEL dataset(https://github.com/spmede/InfoTracer), We also provide the paraphrased version of WikiMIA in our open-source code repository. Source data are provided with this paper.
Code availability
The complete code is publicly available at https://github.com/spmede/InfoTracerand is archived on Zenodo85; the repository includes the code, data, and detailed instructions for reproducing the main experiments. We additionally release our algorithm as an open-source and user-friendly software tool to support its practical adoption and facilitate real-world applications. We also provide sufficient details in the Methods section and Supplementary Information to facilitate the implementation of the experiments in this work.
References
Dathathri, S. et al. Scalable watermarking for identifying large language model outputs. Nature 634, 818–823 (2024).
Heimberg, G. et al. A cell atlas foundation model for scalable search of similar human cells. Nature 638, 1085–1094 (2025).
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Du, Z. et al. Glm: general language model pretraining with autoregressive blank infilling. In Proc. 60th Annual Meeting of the Association for Computational Linguistics 320–335 (ACL, 2022).
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/10.48550/arXiv.2001.08361 (2020).
Longpre, S. et al. A large-scale audit of dataset licensing and attribution in ai. Nat. Mach. Intell. 6, 975–987 (2024).
Widder, D. G., Whittaker, M. & West, S. M. Why openai systems are actually closed, and why this matters. Nature 635, 827–833 (2024).
Samuelson, P. Generative ai meets copyright. Science 381, 158–161 (2023).
Beigi, G. & Liu, H. A survey on privacy in social media: Identification, mitigation, and applications. ACM Trans. Data Sci. 1, 1–38 (2020).
Ziller, A. et al. Reconciling privacy and accuracy in AI for medical imaging. Nat. Mach. Intell. 6, 764–774 (2024).
Kim, S. et al. Propile: probing privacy leakage in large language models. In Proc. 37th Conference on Neural Information Processing Systems (NeurIPS 2023) 20750–20762 (Curran Associates, Inc., 2023).
Carlini, N. et al. Extracting training data from large language models. In Proc. USENIX Security Symposium 2633–2650 (USENIX Association, 2021).
Voigt, P. & Von dem Bussche, A. The EU General Data Protection Regulation (GDPR). A Practical Guide 1st edn(Springer International Publishing, 2017).
Pardau, S. L. The California consumer privacy act: towards a european-style privacy regime in the United States. J. Tech. Law Policy 23, 68 (2018).
Alzahrani, S. M., Salim, N. & Abraham, A. Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans. Syst. Man Cybern. 42, 133–149 (2011).
Folty`nek, T., Meuschke, N. & Gipp, B. Academic plagiarism detection: a systematic literature review. ACM Comput. Surv. 52, 1–42 (2019).
Min, R., Li, S., Chen, H. & Cheng, M. A watermark-conditioned diffusion model for ip protection. In Proc. Computer Vision – ECCV 2024: 18th European Conference 104–120 (Springer, 2024).
Zhu, P., Takahashi, T. & Kataoka, H. Watermark-embedded adversarial examples for copyright protection against diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 24420–24430 (IEEE, 2024).
Wu, X. et al. CGI_DM: digital copyright authentication for diffusion models via contrasting gradient inversion. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10812–10821 (IEEE, 2024).
Zhao, Z. et al. Can protective perturbation safeguard personal data from being exploited by stable diffusion? In Proc. Conference on Computer Vision and Pattern Recognition (CVPR) 24398–24407 (IEEE, 2024).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://doi.org/10.48550/arXiv.2307.09288 (2023).
Shi, W. et al. Detecting pretraining data from large language models. In ICLR (OpenReview, 2024).
Zhang, J. et al. Min-k%++: improved baseline for detecting pre-training data from large language models. In ICLR (OpenReview, 2025).
Zhang, W. et al. Pretraining data detection for large language models: a divergence-based calibration method. In Proc. Conference on Empirical Methods in Natural Language Processing 5263–5274 (ACL, 2024).
Song, C., Zhao, D. & Xiang, J. Not all tokens are equal: Membership inference attacks against fine-tuned language models. In Proc. Annual Computer Security Applications Conference (ACSAC) 31–45 (IEEE, 2024).
Meeus, M., Shilov, I., Faysse, M. & De Montjoye, Y.-A. Copyright traps for large language models. In Proc. 41st International Conference on Machine Learning 35296–35309 (PMLR, 2024).
Fu, W. et al. Membership inference attacks against fine-tuned large language models via self-prompt calibration. In Proc. 38th International Conference on Neural Information Processing Systems 134981–135010 (Curran Associates, Inc., 2024).
Zhou, B. et al. Dpdllm: a black-box framework for detecting pre-training data from large language models. In Proc. Findings of the Association for Computational Linguistics 644–653 (ACL, 2024).
He, Y. et al. Towards label-only membership inference attack against pre-trained large language models. In Proc. 34th USENIX Conference on Security Symposium 1609–1628 (USENIX Association, 2025).
Zhou, R. et al. Blackbox dataset inference for LLM. Preprint at https://doi.org/10.48550/arXiv.2507.03619 (2025).
Chung, J., Kamar, E. & Amershi, S. Increasing diversity while maintaining accuracy: text data generation with large language models and human interventions. In Proc. 61st Annual Meeting of the Association for Computational Linguistics 575–593 (ACL, 2023).
Lee, N. et al. Factuality enhanced language models for open-ended text generation. In Proc. 36th International Conference on Neural Information Processing Systems 34586–34599 (Curran Associates, Inc., 2022).
Huang, J. et al. Large language models can self-improve. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 1051–1068 (ACL, 2023).
Kocetkov, D. et al. The stack: 3 TB of permissively licensed source code. Preprint at https://doi.org/10.48550/arXiv.2211.15533 (2022).
Roziere, B. et al. Code llama: open foundation models for code. Preprint at https://doi.org/10.48550/arXiv.2308.12950 (2023).
Lin, C.-Y. ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out 74–81 (ACL, 2004).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: pre-training of deep bidirectional transformers for language understanding. In Proc. North American Chapter of the Association for Computational Linguistics 4171–4186 (ACL, 2019).
Ristad, E. S. & Yianilos, P. N. Learning string-edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 20, 522–532 (2002).
Carlini, N. et al. Membership inference attacks from first principles. In S&P, 1897–1914 (IEEE, 2022).
Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https://doi.org/10.48550/arXiv.2302.13971 (2023).
Mattern, J. et al. Membership inference attacks against language models via neighbourhood comparison. In Proc. Findings of the Association for Computational Linguistics 11330–11343 (ACL, 2023).
Hallinan, S. et al. The surprising effectiveness of membership inference with simple n-gram coverage. In COLM (OpenReview, 2025).
Duarte, A. V., Zhao, X., Oliveira, A. L. & Li, L. De-cop: detecting copyrighted content in language models training data. In Proc. 41st International Conference on Machine Learning 11940–11956 (PMLR, 2025).
Cheng, D., Huang, S. & Wei, F. Adapting large language models via reading comprehension. In ICLR (OpenReview, 2023).
Rae, J. W. et al. Scaling language models: methods, analysis & insights from training Gopher. Preprint at https://doi.org/10.48550/arXiv.2112.11446 (2021).
Sander, T., Fernandez, P., Durmus, A., Douze, M. & Furon, T. Watermarking makes language models radioactive. In Proc. 38th International Conference on Neural Information Processing Systems 21079–21113 (Curran Associates, Inc., 2024).
Zhang, R., Hussain, S. S., Neekhara, P. & Koushanfar, F. Remark-llm: a robust and efficient watermarking framework for generative large language models. In Proc. 33rd USENIX Conference on Security Symposium 1813–1830 (USENIX Association, 2024).
Kirchenbauer, J. et al. A watermark for large language models. In Proc. 40th International Conference on Machine Learning 17061–17084 (PMLR, 2023).
Ravichander, A. et al. Information-guided identification of training data imprint in (proprietary) large language models. In Proc. Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies 1962–1978 (ACL, 2025).
Achiam, J. et al. Gpt-4 technical report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2023).
Team, G. et al. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. Preprint at https://doi.org/10.48550/arXiv.2403.05530 (2024).
GLM, T. et al. Chatglm: a family of large language models from GLM-130 B to glm-4 all tools. Preprint at https://doi.org/10.48550/arXiv.2406.12793 (2024).
Liu, A. et al. Deepseek-v3 technical report. Preprint at https://doi.org/10.48550/arXiv.2412.19437 (2024).
Bai, J. et al. Qwen technical report. Preprint at https://doi.org/10.48550/arXiv.2309.16609 (2023).
Seed, B. et al. Seed-thinking-v1. 5: advancing superb reasoning models with reinforcement learning. Preprint at https://doi.org/10.48550/arXiv.2504.13914 (2025).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. Bertscore: evaluating text generation with BERT. In ICLR (OpenReview, 2020).
Shokri, R., Stronati, M., Song, C. & Shmatikov, V. Membership inference attacks against machine learning models. In Proc. Transactions on Dependable and Secure Computing 3–18 (IEEE, 2017).
Hu, H. et al. Membership inference attacks on machine learning: a survey. ACM Comput. Surv. 54, 1–37 (2022).
Hu, L. et al. Defenses to membership inference attacks: a survey. ACM Comput. Surv. 56, 1–34 (2023).
Duan, J., Kong, F., Wang, S., Shi, X. & Xu, K. Are diffusion models vulnerable to membership inference attacks? In Proc. 40th International Conference on Machine Learning 8717–8730 (PMLR, 2023).
Choquette-Choo, C. A., Tramer, F., Carlini, N. & Papernot, N. Label-only membership inference attacks. In Proc. 38th International Conference on Machine Learning 1964–1974 (PMLR, 2021).
Truex, S., Liu, L., Gursoy, M. E., Yu, L. & Wei, W. Demystifying membership inference attacks in machine learning as a service. IEEE Trans. Serv. Comput. 14, 2073–2089 (2019).
Ye, J., Maddi, A., Murakonda, S. K., Bindschaedler, V. & Shokri, R. Enhanced membership inference attacks against machine learning models. In Proc. SIGSAC Conference on Computer and Communications Security 3093–3106 (ACM, 2022).
Liu, L., Wang, Y., Liu, G., Peng, K. & Wang, C. Membership inference attacks against machine learning models via prediction sensitivity. IEEE Trans. Dependable Secure Comput. 20, 2341–2347 (2022).
Jia, J., Salem, A., Backes, M., Zhang, Y. & Gong, N. Z. Memguard: defending against black-box membership inference attacks via adversarial examples. In Proc. SIGSAC Conference on Computer and Communications Security 259–274 (ACM, 2019).
Wu, C. et al. Rethinking membership inference attacks against transfer learning. IEEE Trans. Inf. Forensics Secur. 19, 6441–6454 (2024).
Bertran, M. et al. Scalable membership inference attacks via quantile regression. In Proc. 37th International Conference on Neural Information Processing Systems 314–330 (Curran Associates, Inc., 2023).
Liu, Y., Zhao, Z., Backes, M. & Zhang, Y. Membership inference attacks by exploiting loss trajectory. In Proc. 2022 ACM SIGSAC Conference on Computer and Communications Security 2085–2098 (ACM, 2022).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Nasr, M., Shokri, R. & Houmansadr, A. Comprehensive privacy analysis of deep learning: passive and active white-box inference attacks against centralized and federated learning. In Proc. Symposium on Security and Privacy 739–753 (IEEE, 2019).
Melis, L., Song, C., De Cristofaro, E. & Shmatikov, V. Exploiting unintended feature leakage in collaborative learning. In Proc. 40th IEEE Symposium on Security & Privacy 691–706 (IEEE, 2019).
Leino, K. & Fredrikson, M. Stolen memories: leveraging model memorization for calibrated white-box membership inference. In USENIX Security Symposium 1605–1622 (USENIX Association, 2020).
Rezaei, S. & Liu, X. On the difficulty of membership inference attacks. In Proc. Computer Vision and Pattern Recognition 7892–7900 (IEEE, 2021).
Song, L. & Mittal, P. Systematic evaluation of privacy risks of machine learning models. In USENIX Security Symposium 2615–2632 (USENIX Association, 2021).
Song, C. & Raghunathan, A. Information leakage in embedding models. In Proc. 2020 ACM SIGSAC Conference on Computer and Communications Security 377–390 (ACM, 2020).
Long, Y. et al. A pragmatic approach to membership inferences on machine learning models. In Proc. European Symposium on Security and Privacy (EuroS&P) 521–534 (IEEE, 2020).
Li, Z. & Zhang, Y. Membership leakage in label-only exposures. In Proc. 2021 ACM SIGSAC Conference on Computer and Communications Security 880–895 (ACM, 2021).
Yeom, S., Giacomelli, I., Fredrikson, M. & Jha, S. Privacy risk in machine learning: analyzing the connection to overfitting. In Proc. IEEE Computer Security Foundations Symposium 268–282 (IEEE, 2018).
Li, Q. et al. Llm-pbe: assessing data privacy in large language models. Preprint at https://doi.org/10.48550/arXiv.2408.12787.
Reuel, A. et al. Open problems in technical AI governance. Transactions on Machine Learning Research. (2025).
Ishihara, S. Training data extraction from pre-trained language models: a survey. In TrustNLP, 260–275 (ACL, 2023).
Oren, Y., Meister, N., Chatterji, N. S., Ladhak, F. & Hashimoto, T. Proving test set contamination in black-box language models. In ICLR (OpenReview, 2023).
Wei, J., Wang, R. & Jia, R. Proving membership in llm pretraining data via data watermarks. In Proc. Findings of the Association for Computational Linguistics 13306–13320 (ACL, 2024).
Zhao, Z. et al. Can watermarks be used to detect LLM IP infringement for free? In ICLR (OpenReview, 2025).
Qi, T. et al. Auditing unauthorized training data from ai generated content using information isotopes. Zenodo https://doi.org/10.5281/zenodo.18107821 (2025).
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant numbers 62425203 (S.W.), 62502044 (T.Q.), U2336208 (Y.H.), 82090053 (Y.H.), 62032003 (S.W.); Tsinghua University Initiative Scientific Research Program of Precision Medicine under Grant number 2022ZLA007 (Y.H.); Beijing Natural Science Foundation under Grant number L253005 (S.W.); CCF-SANGFOR Research Fund under Grant number 20240202 (T.Q.); Research Initiation Project for Introduced Talents of BUPT under Grant number 2025KYQD11 (T.Q.); the Royal Academy of Engineering via DANTE, a RAEng Chair (N.L.); the European Research Council, specifically the REDIAL project (N.L.); SPRIND under the composite learning challenge (N.L.); and Google through a Google Academic Research Award (N.L.). We also thank Wendan Wang, Haoran Zheng, Yuanhong Huang, Jinrui Wang, Jiajun Liu, Yi Luo, Qian Li, and Qing Li for their helpful discussions and support in this work.
Author information
Authors and Affiliations
Contributions
T.Q., Y.H., and S.W. coordinated the research project and supervised the project with assistance from N.L. and L.L. T.Q., J.Y., S.W., and Y.H. conceived the idea of this work. T.Q., J.Y., D.C., and N.L. design the detection algorithm. J.Y., H.W., P.Y., and Z.Z. implement the algorithms for experiments. T.Q., J.Y., and C.W. constructed the code dataset and implement the algorithms for the corresponding experiments. T.Q., H.W., Z.H., G.N., Y.X., and Z.H. collected and constructed the NEWS datasets for experiments. T.Q., J.Y., C.W., D.C., L.L., and N.L. analyzed the results. All authors contributed to the writing of this paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Qi, T., Yin, J., Cai, D. et al. Auditing unauthorized training data from AI generated content using information isotopes. Nat Commun 17, 3007 (2026). https://doi.org/10.1038/s41467-026-68862-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-026-68862-x








