Abstract
This paper presents a structured, scene-level dataset of movie content that addresses the limitations of previous research relying on small or non-standardized screenplay collections. Such collections often lack consistent scene representations and actor metadata and use draft versions that differ from their final cinematic products, limiting both the scale and accuracy for content-level analysis. To overcome these limitations, we compile scene breakdowns for 3,265 movies from Amazon X-Ray in the US Amazon Prime Video market, detailing the characters appearing in each scene and linking them to their corresponding IMDb IDs. Subtitles are included for the subset of 3,110 movies, providing complementary dialogue-level data, and each title is linked to its corresponding IMDb ID to enable augmentation with additional metadata for extended analyses. Integration of these resources can allow accurate, large-scale analyses of on-screen representation, character interactions, and narrative structure that were not feasible with earlier screenplay-based datasets. This dataset enhances the consistency and accessibility of movie data, providing a reliable stepping stone for quantitative research on film as cultural artifacts.
Similar content being viewed by others
Data availability
The dataset is available at Zenodo with https://doi.org/10.5281/zenodo.1765973435.
Code availability
Code is available on Github at https://github.com/safal312/xray-collector.
References
Belton, J.Movies and mass culture (Bloomsbury Publishing, 1996).
Grindstaff, L. & Turow, J. Video cultures: Television sociology in the “new tv” age. Annual Review of Sociology 32, 103–125 (2006).
Reagan, A. J., Mitchell, L., Kiley, D., Danforth, C. M. & Dodds, P. S. The emotional arcs of stories are dominated by six basic shapes. EPJ data science 5, 1–12 (2016).
Park, M., Park, J., Rojas, F. & Ahn, Y.-Y. Rap music as a social reflection: Exploring the relationship between social conditions and expressions of violence and materialism in rap lyrics. SocArXiv (2024).
Park, M., Thom, J., Mennicken, S., Cramer, H. & Macy, M. Global music streaming data reveal diurnal and seasonal patterns of affective preference. Nature Human Behaviour 3, 230–236 (2019).
Lee, H. et al. Global music discoveries reveal cultural shifts during the war in ukraine. PsyArXiv (2024).
Liu, L., Dehmamy, N., Chown, J., Giles, C. L. & Wang, D. Understanding the onset of hot streaks across artistic, cultural, and scientific careers. Nature Communications 12, 5392 (2021).
Lee, K., Park, J., Goree, S., Crandall, D. & Ahn, Y.-Y. Social signals predict contemporary art prices better than visual features, particularly in emerging markets. Scientific Reports 14, 11615 (2024).
McDonnell, T. E. Cultural objects, material culture, and materiality. Annual Review of Sociology 49, 195–220 (2023).
Park, M., Weber, I., Naaman, M. & Vieweg, S. Understanding musical diversity via online social media. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 9, 308–317 (2015).
Park, M., Park, J., Baek, Y. M. & Macy, M. Cultural values and cross-cultural video consumption on youtube. PLoS ONE 12, e0177865 (2017).
Ramakrishna, A., Martínez, V. R., Malandrakis, N., Singla, K. & Narayanan, S. Linguistic analysis of differences in portrayal of movie characters. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1669–1678 (2017).
Gorinski, P. J. & Lapata, M. Movie Script Summarization as Graph-based Scene Extraction. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. p. 1066–1076, (Eds. Rada Mihalcea, Joyce Chai, Anoop Sarkar) https://doi.org/10.3115/v1/N15-1113 (Gorinski & Lapata, NAACL 2015).
Davies, M. The Corpus of Contemporary American English (COCA). Available online at https://www.english-corpora.org/coca/ (2008).
Kagan, D., Chesney, T. & Fire, M. Using data science to understand the film industry’s gender gap. Palgrave Communications 6, 1–16 (2020).
Tran, Q. D. & Jung, J. E. Cocharnet: Extracting social networks using character co-occurrence in movies. J. Univers. Comput. Sci. 21, 796–815 (2015).
Malik, M., Hopp, F. R. & Weber, R. Representations of Racial Minorities in Popular Movies. Computational Communication Research 4, https://doi.org/10.5117/CCR2022.1.006.MALI (2022).
Agarwal, A., Zheng, J., Kamath, S., Balasubramanian, S. & Dey, S. A. Key female characters in film have more to talk about besides men: Automating the bechdel test. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 830–840 (2015).
Lee, O.-J. & Jung, J. J. Story embedding: Learning distributed representations of stories based on character networks. Artificial Intelligence 281, 103235, https://doi.org/10.1016/j.artint.2020.103235 (2020).
Mourchid, Y. et al. Movienet: a movie multilayer network model using visual and textual semantic cues. Applied Network Science 4, 121, https://doi.org/10.1007/s41109-019-0226-0 (2019).
Kaminski, J., Schober, M., Albaladejo, R., Zastupailo, O. & Hidalgo, C. Moviegalaxies - Social Networks in Movies, https://doi.org/10.7910/DVN/T4HBA3 (2018).
Agarwal, A., Balasubramanian, S., Zheng, J. & Dash, S. Parsing screenplays for extracting social networks from movies. In Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL), 50–58 (2014).
Lee, O.-J., Jo, N. & Jung, J. J. Measuring character-based story similarity by analyzing movie scripts. In Text2Story@ ECIR, 41–45 (2018).
Ju, X. et al. Miradata: A large-scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems 37, 48955–48970 (2024).
Zhang, Q., Yue, Z., Hu, A., Wang, Z. & Jin, Q. MovieUN: A dataset for movie understanding and narrating. In Goldberg, Y., Kozareva, Z. & Zhang, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2022, 1873–1885, https://doi.org/10.18653/v1/2022.findings-emnlp.135 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).
Chen, L. et al. Sharegpt4video: Improving video understanding and generation with better captions. Advances in Neural Information Processing Systems 37, 19472–19495 (2024).
Kayal, P., Mettes, P., Dehmamy, N. & Park, M. Large language models are natural video popularity predictors. In Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, 11432–11464, https://doi.org/10.18653/v1/2025.findings-acl.597 (Association for Computational Linguistics, Vienna, Austria, 2025).
Selenium wire. https://pypi.org/project/selenium-wire/. Accessed: August 2023.
Selenium. https://www.selenium.dev/. Accessed: August 2023.
Unidecode. https://pypi.org/project/Unidecode/. Accessed: August 2023.
Beautifulsoup. https://beautiful-soup-4.readthedocs.io/en/latest/. Accessed: August 2023.
Poggel, L. & Fischer, F. Automatic extraction of network data from amazon prime videos (using ‘1917’ as an example). https://weltliteratur.net/extracting-network-data-from-amazon-prime-videos/ (2022).
Cinemagoer. https://cinemagoer.github.io/. Accessed: September 2023.
Introducing ‘x-ray for movies,’ powered by imdb and available exclusively on the all-new kindle fire family.Amazon.com press center (2012).
Shrestha, S., Heo, Y., Barron, A. T. & Park, M. Scene-level movie data from Amazon X-Ray in the us market combined with IMDb, https://doi.org/10.5281/zenodo.17659734 (2025).
Acknowledgements
This work was partially supported by the NYUAD Center for Interacting Urban Networks (CITIES), funded by Tamkeen under the NYUAD Research Institute Award CG001.
Author information
Authors and Affiliations
Contributions
S.S. and M.P. conceived of the data. S.S. and Y.H. harvested, processed, and validated the data with M.P.’s help. M.P. and A.T.J.B. supervised the project. M.P., Y.H., S.S., and A.T.J.B. wrote the manuscript. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Shrestha, S., Heo, Y., Barron, A.T.J. et al. Scene-level movie data from Amazon X-Ray in the US market combined with IMDb. Sci Data (2026). https://doi.org/10.1038/s41597-026-06602-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-06602-y


