Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Data
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific data
  3. data descriptors
  4. article
Scene-level movie data from Amazon X-Ray in the US market combined with IMDb
Download PDF
Download PDF
  • Data Descriptor
  • Open access
  • Published: 20 January 2026

Scene-level movie data from Amazon X-Ray in the US market combined with IMDb

  • Safal Shrestha1 na1,
  • Yeonie Heo1 na1,
  • Alexander T. J. Barron  ORCID: orcid.org/0000-0002-3761-76831 &
  • …
  • Minsu Park  ORCID: orcid.org/0000-0002-9610-29381 

Scientific Data , Article number:  (2026) Cite this article

  • 1139 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Arts
  • Communication
  • Sociology

Abstract

This paper presents a structured, scene-level dataset of movie content that addresses the limitations of previous research relying on small or non-standardized screenplay collections. Such collections often lack consistent scene representations and actor metadata and use draft versions that differ from their final cinematic products, limiting both the scale and accuracy for content-level analysis. To overcome these limitations, we compile scene breakdowns for 3,265 movies from Amazon X-Ray in the US Amazon Prime Video market, detailing the characters appearing in each scene and linking them to their corresponding IMDb IDs. Subtitles are included for the subset of 3,110 movies, providing complementary dialogue-level data, and each title is linked to its corresponding IMDb ID to enable augmentation with additional metadata for extended analyses. Integration of these resources can allow accurate, large-scale analyses of on-screen representation, character interactions, and narrative structure that were not feasible with earlier screenplay-based datasets. This dataset enhances the consistency and accessibility of movie data, providing a reliable stepping stone for quantitative research on film as cultural artifacts.

Similar content being viewed by others

ChineseMPD: A Semantic Segmentation Dataset of Chinese Martial Arts Classic Movie Props

Article Open access 14 August 2024

Prioritizing motion pictures for archival preservation using a decision-aiding Bayesian network

Article Open access 29 July 2025

A transformer-based architecture for collaborative filtering modeling in personalized recommender systems

Article Open access 08 July 2025

Data availability

The dataset is available at Zenodo with https://doi.org/10.5281/zenodo.1765973435.

Code availability

Code is available on Github at https://github.com/safal312/xray-collector.

References

  1. Belton, J.Movies and mass culture (Bloomsbury Publishing, 1996).

  2. Grindstaff, L. & Turow, J. Video cultures: Television sociology in the “new tv” age. Annual Review of Sociology 32, 103–125 (2006).

    Google Scholar 

  3. Reagan, A. J., Mitchell, L., Kiley, D., Danforth, C. M. & Dodds, P. S. The emotional arcs of stories are dominated by six basic shapes. EPJ data science 5, 1–12 (2016).

    Google Scholar 

  4. Park, M., Park, J., Rojas, F. & Ahn, Y.-Y. Rap music as a social reflection: Exploring the relationship between social conditions and expressions of violence and materialism in rap lyrics. SocArXiv (2024).

  5. Park, M., Thom, J., Mennicken, S., Cramer, H. & Macy, M. Global music streaming data reveal diurnal and seasonal patterns of affective preference. Nature Human Behaviour 3, 230–236 (2019).

    Google Scholar 

  6. Lee, H. et al. Global music discoveries reveal cultural shifts during the war in ukraine. PsyArXiv (2024).

  7. Liu, L., Dehmamy, N., Chown, J., Giles, C. L. & Wang, D. Understanding the onset of hot streaks across artistic, cultural, and scientific careers. Nature Communications 12, 5392 (2021).

    Google Scholar 

  8. Lee, K., Park, J., Goree, S., Crandall, D. & Ahn, Y.-Y. Social signals predict contemporary art prices better than visual features, particularly in emerging markets. Scientific Reports 14, 11615 (2024).

    Google Scholar 

  9. McDonnell, T. E. Cultural objects, material culture, and materiality. Annual Review of Sociology 49, 195–220 (2023).

    Google Scholar 

  10. Park, M., Weber, I., Naaman, M. & Vieweg, S. Understanding musical diversity via online social media. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 9, 308–317 (2015).

  11. Park, M., Park, J., Baek, Y. M. & Macy, M. Cultural values and cross-cultural video consumption on youtube. PLoS ONE 12, e0177865 (2017).

    Google Scholar 

  12. Ramakrishna, A., Martínez, V. R., Malandrakis, N., Singla, K. & Narayanan, S. Linguistic analysis of differences in portrayal of movie characters. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1669–1678 (2017).

  13. Gorinski, P. J. & Lapata, M. Movie Script Summarization as Graph-based Scene Extraction. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. p. 1066–1076, (Eds. Rada Mihalcea, Joyce Chai, Anoop Sarkar) https://doi.org/10.3115/v1/N15-1113 (Gorinski & Lapata, NAACL 2015).

  14. Davies, M. The Corpus of Contemporary American English (COCA). Available online at https://www.english-corpora.org/coca/ (2008).

  15. Kagan, D., Chesney, T. & Fire, M. Using data science to understand the film industry’s gender gap. Palgrave Communications 6, 1–16 (2020).

    Google Scholar 

  16. Tran, Q. D. & Jung, J. E. Cocharnet: Extracting social networks using character co-occurrence in movies. J. Univers. Comput. Sci. 21, 796–815 (2015).

    Google Scholar 

  17. Malik, M., Hopp, F. R. & Weber, R. Representations of Racial Minorities in Popular Movies. Computational Communication Research 4, https://doi.org/10.5117/CCR2022.1.006.MALI (2022).

  18. Agarwal, A., Zheng, J., Kamath, S., Balasubramanian, S. & Dey, S. A. Key female characters in film have more to talk about besides men: Automating the bechdel test. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 830–840 (2015).

  19. Lee, O.-J. & Jung, J. J. Story embedding: Learning distributed representations of stories based on character networks. Artificial Intelligence 281, 103235, https://doi.org/10.1016/j.artint.2020.103235 (2020).

    Google Scholar 

  20. Mourchid, Y. et al. Movienet: a movie multilayer network model using visual and textual semantic cues. Applied Network Science 4, 121, https://doi.org/10.1007/s41109-019-0226-0 (2019).

    Google Scholar 

  21. Kaminski, J., Schober, M., Albaladejo, R., Zastupailo, O. & Hidalgo, C. Moviegalaxies - Social Networks in Movies, https://doi.org/10.7910/DVN/T4HBA3 (2018).

  22. Agarwal, A., Balasubramanian, S., Zheng, J. & Dash, S. Parsing screenplays for extracting social networks from movies. In Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL), 50–58 (2014).

  23. Lee, O.-J., Jo, N. & Jung, J. J. Measuring character-based story similarity by analyzing movie scripts. In Text2Story@ ECIR, 41–45 (2018).

  24. Ju, X. et al. Miradata: A large-scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems 37, 48955–48970 (2024).

    Google Scholar 

  25. Zhang, Q., Yue, Z., Hu, A., Wang, Z. & Jin, Q. MovieUN: A dataset for movie understanding and narrating. In Goldberg, Y., Kozareva, Z. & Zhang, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2022, 1873–1885, https://doi.org/10.18653/v1/2022.findings-emnlp.135 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).

  26. Chen, L. et al. Sharegpt4video: Improving video understanding and generation with better captions. Advances in Neural Information Processing Systems 37, 19472–19495 (2024).

    Google Scholar 

  27. Kayal, P., Mettes, P., Dehmamy, N. & Park, M. Large language models are natural video popularity predictors. In Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, 11432–11464, https://doi.org/10.18653/v1/2025.findings-acl.597 (Association for Computational Linguistics, Vienna, Austria, 2025).

  28. Selenium wire. https://pypi.org/project/selenium-wire/. Accessed: August 2023.

  29. Selenium. https://www.selenium.dev/. Accessed: August 2023.

  30. Unidecode. https://pypi.org/project/Unidecode/. Accessed: August 2023.

  31. Beautifulsoup. https://beautiful-soup-4.readthedocs.io/en/latest/. Accessed: August 2023.

  32. Poggel, L. & Fischer, F. Automatic extraction of network data from amazon prime videos (using ‘1917’ as an example). https://weltliteratur.net/extracting-network-data-from-amazon-prime-videos/ (2022).

  33. Cinemagoer. https://cinemagoer.github.io/. Accessed: September 2023.

  34. Introducing ‘x-ray for movies,’ powered by imdb and available exclusively on the all-new kindle fire family.Amazon.com press center (2012).

  35. Shrestha, S., Heo, Y., Barron, A. T. & Park, M. Scene-level movie data from Amazon X-Ray in the us market combined with IMDb, https://doi.org/10.5281/zenodo.17659734 (2025).

Download references

Acknowledgements

This work was partially supported by the NYUAD Center for Interacting Urban Networks (CITIES), funded by Tamkeen under the NYUAD Research Institute Award CG001.

Author information

Author notes
  1. These authors contributed equally: Safal Shrestha, Yeonie Heo.

Authors and Affiliations

  1. New York University Abu Dhabi, Abu Dhabi, UAE

    Safal Shrestha, Yeonie Heo, Alexander T. J. Barron & Minsu Park

Authors
  1. Safal Shrestha
    View author publications

    Search author on:PubMed Google Scholar

  2. Yeonie Heo
    View author publications

    Search author on:PubMed Google Scholar

  3. Alexander T. J. Barron
    View author publications

    Search author on:PubMed Google Scholar

  4. Minsu Park
    View author publications

    Search author on:PubMed Google Scholar

Contributions

S.S. and M.P. conceived of the data. S.S. and Y.H. harvested, processed, and validated the data with M.P.’s help. M.P. and A.T.J.B. supervised the project. M.P., Y.H., S.S., and A.T.J.B. wrote the manuscript. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Alexander T. J. Barron or Minsu Park.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shrestha, S., Heo, Y., Barron, A.T.J. et al. Scene-level movie data from Amazon X-Ray in the US market combined with IMDb. Sci Data (2026). https://doi.org/10.1038/s41597-026-06602-y

Download citation

  • Received: 14 November 2024

  • Accepted: 09 January 2026

  • Published: 20 January 2026

  • DOI: https://doi.org/10.1038/s41597-026-06602-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Follow us on Twitter
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Editors & Editorial Board
  • Journal Metrics
  • Policies
  • Open Access Fees and Funding
  • Calls for Papers
  • Contact

Publish with us

  • Submission Guidelines
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Data (Sci Data)

ISSN 2052-4463 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing