Transparency (in training data) is what we want

doi:10.1038/s42256-025-01023-9

Editorial
Published: 24 March 2025

Transparency (in training data) is what we want

Nature Machine Intelligence volume 7, page 329 (2025)Cite this article

5917 Accesses
1 Citations
2 Altmetric
Metrics details

As more powerful generative AI tools appear on the market, legal debates about the use of copyrighted content to develop such tools are intensifying. To resolve these issues, transparency regarding which copyrighted data have been used and where in the AI training pipeline needs to be a starting point.

Last year saw major advancements in the generation of artificial intelligence (AI)-based videos, with OpenAI launching Sora and Google unveiling Veo2. Both tools demonstrate impressive results, creating stunning videos from simple prompts. These video generation tools are a natural next step for generative AI models, following the success of text and image generation tools. However, the transformative and disruptive economic effects, and questions over ownership, could be immediate, with media, art and entertainment industries being directly affected. The release of Sora in the UK last month led to creative industry representatives voicing strong legal concerns, as the tool is trained on artists’ output without consent or payment¹.

Last month also saw the release of an album with only silent tracks, produced by over 1,000 musicians in protest against proposed changes in copyright law in the UK². These changes would allow AI companies more freedom to harvest content from the internet. The album, entitled ‘Is this what we want?’, is part of wider protest from artists, media and the news industry, making the case that training generative AI on vast amounts of internet data, much of it copyright-protected, is happening unfairly without consent or payment. On the other side, AI tech companies such as OpenAI argue that they follow the principle of ‘fair use’, which allows data mining to generate new outputs, provided that the original content is not copied and the output is not in direct competition with the original work. However, in practice, the ‘fair use doctrine’ remains a legal grey area, as mentioned even on the website of Sora.AI.

With the release and adoption of video generation tools, the number of legal disputes is likely to increase. Several court cases have emerged in the past few years, with perhaps the most high-profile example involving the New York Times, together with two other publishers, starting a federal lawsuit against OpenAI for copyright infringement³. The plaintiffs accuse OpenAI of including millions of valuable, copyright-protected works from news outlets without consent or payment, and argue that their content has been used to train models that reproduce similar works in direct competition with their service. The latter point in particular would contest the claim from OpenAI that their use of copyrighted data in training adheres to fair use.

It could be argued that current legal frameworks are not fit for purpose in the age of generative AI. The legal uncertainty affects all parties involved, including small AI companies trying to develop useful tools and content creators who could use AI generative tools in innovative ways. An important underlying issue is that it is often unclear what training data has been used in the development of generative AI models. The UK copyright law proposal² suggests that data mining could be made more permissible to encourage economic development in both the AI and the entertainment industries, with the caveat that copyright holders can reserve their rights and opt out. However, this may be near-impossible in reality given the current lack of openness on how training data were acquired and what data were used. It is common practice to combine, repackage and re-release datasets, leading to unclear or incomplete information on data origins and licensing⁴. Technical methods to prove that data were used in the training of a specific model are unreliable, and might remain so, owing to the stochastic training process of large models.

Therefore, to move legal debates along, serious efforts are needed to increase transparency in the training pipeline of AI generative tools. Contributing to this goal Longpre et al.⁴ describe a large-scale audit of over 1,800 text-based datasets in a paper published in this journal last year. They develop tools to trace the origins and licensing details of these datasets, allowing for proper attribution. Another technical challenge that needs solving is to ‘unlearn’ data from a trained large language model (LLM) when required for copyright protection or other reasons. A recent Review article⁵ analyses current methods and practices in machine unlearning and highlights that further developments are required to ensure such tools can be reliably deployed in a range of LLM-based applications.

In the current geopolitical climate, countries are focussed on competing in AI development and seem keen to loosen restrictions rather than introduce more. However, policy developments are not keeping up with the rapid pace of generative AI tool development, and legal risks could soon be hanging over companies and content creators who are developing and using these tools. At a minimum, insisting on greater transparency from AI companies in what training data they use seems a constructive and reasonable goal.

References

Milmo, D. The Guardian https://go.nature.com/41NKdsT (28 February 2025).
Copyright and AI: Consultation https://go.nature.com/4bQqTzR (UK Government, 17 December 2024).
Allyn, B. New York Times https://go.nature.com/4irQ5iJ (14 January 2025).
Longpre, S. et al. Nat. Mach. Intell. 6, 975–987 (2024).
Google Scholar
Liu, S. et al. Nat. Mach. Intell. 7, 181–194 (2025).
Google Scholar

Download references

Rights and permissions

Reprints and permissions

About this article

Cite this article

Transparency (in training data) is what we want. Nat Mach Intell 7, 329 (2025). https://doi.org/10.1038/s42256-025-01023-9

Download citation

Published: 24 March 2025
Issue date: March 2025
DOI: https://doi.org/10.1038/s42256-025-01023-9

Transparency (in training data) is what we want

References

Rights and permissions

About this article

Cite this article

Search

Quick links

References

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links