Scene-level movie data from Amazon X-Ray in the US market combined with IMDb

Shrestha, Safal; Heo, Yeonie; Barron, Alexander T. J.; Park, Minsu

doi:10.1038/s41597-026-06602-y

Download PDF

Data Descriptor
Open access
Published: 20 January 2026

Scene-level movie data from Amazon X-Ray in the US market combined with IMDb

Scientific Data volume 13, Article number: 275 (2026) Cite this article

2051 Accesses
Metrics details

Subjects

Abstract

This paper presents a structured, scene-level dataset of movie content that addresses the limitations of previous research relying on small or non-standardized screenplay collections. Such collections often lack consistent scene representations and actor metadata and use draft versions that differ from their final cinematic products, limiting both the scale and accuracy for content-level analysis. To overcome these limitations, we compile scene breakdowns for 3,265 movies from Amazon X-Ray in the US Amazon Prime Video market, detailing the characters appearing in each scene and linking them to their corresponding IMDb IDs. Subtitles are included for the subset of 3,110 movies, providing complementary dialogue-level data, and each title is linked to its corresponding IMDb ID to enable augmentation with additional metadata for extended analyses. Integration of these resources can allow accurate, large-scale analyses of on-screen representation, character interactions, and narrative structure that were not feasible with earlier screenplay-based datasets. This dataset enhances the consistency and accessibility of movie data, providing a reliable stepping stone for quantitative research on film as cultural artifacts.

ChineseMPD: A Semantic Segmentation Dataset of Chinese Martial Arts Classic Movie Props

Article Open access 14 August 2024

Prioritizing motion pictures for archival preservation using a decision-aiding Bayesian network

Article Open access 29 July 2025

A transformer-based architecture for collaborative filtering modeling in personalized recommender systems

Article Open access 08 July 2025

Background & Summary

Movies are one of the most influential forms of cultural expression, playing a critical role in shaping and reflecting societal norms, values, and identities^1,2. Despite their global reach and cultural significance, research on films has been largely limited to genre-level or metadata-based analysis, lacking the depth of content-level examination that other art forms have enjoyed. Literature, music, and visual arts have benefited from detailed, large-scale analyses, ranging from textual analysis in literature and lyrics^3,4 to acoustic and visual analysis in music^5,6 and art^7,8. These content-driven methodologies have enabled deeper exploration of themes, narratives, and societal impact across time and space^{3,5,6,8,9,10,11}. However, movies, which are equally or even more widespread and accessible than these other cultural forms, have not received comparable analytical attention. This gap has hindered our ability to fully leverage films as complex social and cultural artifacts, largely due to the limited availability of comprehensive and accurate content-level data sources.

Existing sources for content-level film analysis, such as screenplays^12,13 and large-scale subtitle corpora¹⁴, have provided useful but incomplete insights. Screenplays offer rich narrative details, including dialogues, scene descriptions, and technical notes, but often exist only as early drafts that diverge from the final cinematic product. Subtitles capture spoken dialogue along with speaker labels and sound effects, but omit the visual and non-verbal dimensions of scenes. Critically, both sources lack the ability to accurately map narrative content to precise character identities and temporal boundaries of scenes, limiting the potential for reliable, character- and scene-centric investigations. As a result, previous computational work on film, such as character network or demographic representation analysis^{15,16,17,18,19,20,21,22}, has often relied on heuristic or error-prone extraction methods from these text-based resources.

More specifically, many recent studies have employed network abstractions of character interaction, using social network analysis to measure differences in demographic representation^15,16,17,18 and applying advances in graph embeddings^19,20. These studies typically rely on scene co-occurrence networks extracted from scripts and subtitles^{16,20– 23}. However, this approach faces compounding challenges. Beyond the fundamental issue of discrepancies between publicly available scripts and their final filmed versions, the network extraction process itself presents significant difficulties. Character disambiguation remains a well-known problem¹⁵, making it difficult to reliably construct networks and map characters to rich external metadata, such as actor demographics (e.g., gender, race, age) available on platforms like IMDb.

In this context, Amazon X-Ray provides a unique and reliable source of information that can overcome many of these issues and create complementarity. It contains curated, scene-level information about all visible characters in a movie, including non-speaking roles, thereby enabling precise reconstruction of character (co-)occurrence within each scene. These data are synchronized with on-screen content rather than inferred from textual cues, allowing for more accurate representation of narrative dynamics. Each character is also linked to an IMDb identifier, providing access to rich, structured metadata (e.g., gender, race, and occupation). However, X-Ray metadata lacks IMDb identifiers at the movie level and remains embedded within Amazon’s proprietary ecosystem, limiting its direct reuse for research.

To overcome these limitations, our work contributes in three primary ways: (1) large-scale harvesting and processing of X-Ray data, including scene-level character information and associated subtitles, from the U.S. Amazon Prime Video platform; (2) accurate mapping of movies to their corresponding IMDb identifiers using an automated and validated title- and cast-based matching pipeline; and (3) systematic validation of data coverage and accuracy, providing useful assessments on reproducibility and representativeness across decades and genres.

The dataset includes 3,265 movies with scene-level breakdowns of character appearances, linked to IMDb IDs (on both character- and movie-level). A subset of 3,110 movies additionally includes corresponding subtitles. Specifically, the scene breakdowns derived from Amazon X-Ray provide precise start and end timestamps that delineate the temporal boundaries of each scene with character appearances, enabling clear segmentation of the film’s structure. The subtitle data, in turn, contain start and end timestamps for every line of dialogue, making it straightforward to determine the exact scene in which each line was spoken. Researchers can further enrich the dataset by retrieving additional metadata directly from IMDb using the provided identifiers, enabling a range of analyses spanning representation, screen time, language use, and network structure. Note that although we are unable to include screenplay data due to legal restrictions on redistribution, we provide open-source code that allows researchers to independently expand the dataset by collecting or integrating legally permissible materials.

Our dataset has limitations that warrant acknowledgment. It relies on Amazon’s internal, proprietary processes for X-Ray data generation, and Amazon controls which movies are available at any given time (see the Technical Validation section for the representativeness of the data provided in this Descriptor, relative to award-winning movies by decade). Despite these constraints, we believe this dataset offers substantial net improvements over previous methods and sources and can lead to a significant advancement in film analysis that brings it closer to the depth of exploration that literature, music, and art have long enjoyed. Moreover, as video understanding emerges as a key research area in AI and machine learning^24,25,26,27, while relatively small, this dataset provides a high-quality, structured resource to help advance new computational models and analyses. By making this comprehensive content-level dataset publicly available, we offer researchers a valuable tool to explore underrepresented areas of analysis in the broader domain of culture and creative work.

Methods

Our data collection pipeline comprised several steps of retrieval and refinement, as illustrated in Fig. 1 and described in this section.

Retrieval of Movie Entries from Amazon US

Defining retrieval scope

Due to intellectual property laws, Amazon Prime Video offers different selections of movies and TV series across various regions. Our data resource focuses on the US market, collecting data from the Amazon US website (https://www.amazon.com/gp/video/storefront) in August 2023. At the time of collection, the US Amazon website featured a catalog of movies and TV series under the Prime Video category. We chose to collect only movies bundled with Prime, which did not incur additional costs beyond the Prime subscription, ensuring the broadest possible audience for the corpus.

Retrieving initial data

We used the selenium-wire browser automation library²⁸, an extension of selenium²⁹ that allows inspection of browser requests and responses, for data collection. The Amazon US website limits pagination, allowing navigation up to page 400. To overcome this limitation, we employed a filtering approach to access all movie entries in successive cohorts. First, we gathered movies marked as “Included with Prime” without applying additional filters. We then expanded this initial collection by filtering movies by their release year in decade-based batches: before 2010, between 2010-2020, and after 2020. Although Amazon’s filtering is not always accurate, this approach increased data recall when merging results across cohorts.

Processing entries and duplicates removal

Each entry we retrieved included its page URL and film title. Through manual inspection, we found that multiple films could share the same title, and a single film could have multiple titles. To remove duplicates, we used the film title and a portion of the unique URL from the Prime Video page, as shown in Fig. 3. Entries were identified as duplicates if they shared the same “title and URL portion” pair. This heuristic successfully de-duplicated most of the movies in our dataset, resulting in 11,128 entries with links to their respective Prime Video pages.

For each entry, we created a unique identifier using the title collected at this step. This identifier, used as a directory name, was constructed by preprocessing the title to remove non-alphanumeric characters with the unidecode library³⁰, mapping any non-ASCII characters to ASCII format and replacing spaces with underscores. Each movie identifier was also prefixed with its index within the batch. For example, a movie titled “12 Days with God” was mapped to “1265_12_Days_with_God,” where 1265 is the index and 12_Days_with_God is the processed title.

Collection of X-Ray Data with IMDb ID Mapping

Filtering movies without X-Ray

Not all collected listings on Amazon Prime Video contain X-Ray data. We used browser automation tools to visit each movie’s Prime Video page and collect additional metadata on X-Ray availability. As shown in Fig. 4, the Prime Video page includes movie details such as title, description, and tags. The presence of an “X-Ray” tag indicates whether X-Ray data is available. After processing these pages with BeautifulSoup³¹, we excluded movies without the X-Ray tag, reducing the dataset to 3,823 entries.

Retrieving X-Ray data and metadata

We collected details for each entry in our list by intercepting two network requests triggered when clicking the play button on each movie. First, we intercepted the request for “PlaybackResource,” whose JSON response contained metadata, including the title, entity type (movies), runtime, synopsis, ratings, subtitle types (subtitle, narrative, or SDH), descriptions, image links, and links to subtitles (in multiple languages, where available). It also included additional information such as audio tracks and additional metadata uniquely available on the Prime Video platform. Second, we intercepted the request for “X-Ray,” whose JSON response contained timestamp information for characters appearing in different scenes. This approach builds on previous work³². We saved the responses into two JSON files: PlaybackResources.json and Xray.json.

Processing X-Ray data

We extracted metadata from each movie’s PlaybackResource file and compiled it into a single file. We then parsed the X-Ray files into three structured files for each movie, as detailed in Table 1. These X-Ray-derived files include scene boundary timestamps (in scenes.csv) and scene-level character appearance information (in people_in_scenes.csv), providing precise information on which characters appear on screen. Not all X-Ray files included scene-wise cast appearance data, so we removed entries with missing or incomplete scene-level cast data, leaving 3,570 entries.

Table 1 Parsed X-Ray data files.

Full size table

Note that we compiled subtitle files (.ttml2) for all movies in our metadata list that included them (see Fig. 2). These subtitle files provide start and end timestamps for contiguous subtitles text shown on-screen, allowing researchers to identify the precise time periods when dialogue occurs. When the timing is sufficiently close to or falls within the scene boundaries defined by X-Ray data, these dialogue segments can be associated with the corresponding scene context (see the Jupyter notebook included in the dataset for examples of this).

Mapping IMDb IDs

Linking each retrieved X-Ray movie to its corresponding IMDb ID provides a way to enrich our data with background cast, alternative titles, user ratings, crew and cast information, awards, nominations, quotes, and more. Unfortunately, the PlaybackResource information does not include IMDb IDs, so we devised an algorithm inspired by Ramakrishna et al.¹² to match movies to their IMDb entries. Since X-Ray data already provides accurate information about the actors, along with their IMDb profiles, we used this data to assist in matching.

Accordingly, we used the cinemagoer Python package³³ to retrieve data from IMDb. We initiated the search using the movie title from its PlaybackResource file, which returned several matches from IMDb. We then examined the top 5 cast members from the top 5 movie matches. Previous studies have shown that IMDb cast order reflects the importance of cast members^15,16, and matching key cast members can clearly distinguish between movies. Additionally, since the X-Ray data is sourced directly from IMDb³⁴, matching the top 5 cast proved effective. If at least one actor from this top-5 list matched an actor from the X-Ray data, we recorded the IMDb ID as a match. This process resulted in 3,129 successful matches out of 3,570 movies, leaving 441 unmatched.

We manually mapped the 441 unmatched movies to their correct IMDb IDs, and the procedure is detailed in the Technical Validation section. To verify the overall accuracy of our initial, automated IMDb ID matching process, we randomly sampled 120 movies from the 3,129 matches and found only one matching error, resulting in an error rate of 0.83%. The code for this automated matching process is available on Github at https://github.com/safal312/xray-collector. The final dataset, including all the manually matched entries, comprises 3,265 movies and represents a thoroughly validated and cleaned set of matches. All identified sources of error were corrected, resulting in a dataset that is ready for use³⁵ (Fig. 2).

Data Records

The dataset is available at Zenodo with https://doi.org/10.5281/zenodo.17659734³⁵. The processed and cross-referenced X-Ray data files described in the Methods section are outlined in Fig. 2. The parsed X-Ray data, metadata, and cast data files are provided in .csv or .txt format and are detailed in Tables 1, 2, and 3. A Jupyter notebook containing examples of how to query the dataset is included in the data repository.

Table 2 Fields in X-Ray movie metadata file metadata_with_subtitles_tmdb.csv, one of the top-level files in the schema shown in Fig. 2.

Full size table

Table 3 Fields in cast metadata file final_all_cast_with_duplicates.csv, one of the top-level files in the schema shown in Fig. 2.

Full size table

Technical Validation

Validation and Resolution of Errors

We conducted technical validation for two sets of movies: (1) those successfully matched with their corresponding IMDb IDs and (2) those that initially remained unmatched.

For the first set of 3,129 movies successfully matched to IMDb IDs, we assessed the accuracy of the matching process as described in the “Mapping IMDb IDs” section. We randomly sampled 120 movies from this set and found only one matching error, resulting in an error rate of 0.83%, which suggests high accuracy and reliability for our automated matching process.

For the 441 unmatched movies, we manually mapped each movie to the correct IMDb ID, aiming to maximize accuracy. In this process, we also identified the reasons for the initial matching failures. First, using the movie title and additional metadata (e.g., description, release year, and cast), we retrieved IMDb IDs for each entry, regardless of whether they were classified strictly as movies. During this manual verification, we identified several non-movie entries (e.g., stand-up comedy specials and anthologies) mistakenly included due to Amazon Prime Video’s classification errors. These entries (N = 20), which do not follow traditional narrative movie structures, were excluded from the final dataset.

Further investigation of the 441 unmatched entries revealed that some PlaybackResource and X-Ray files did not correspond to the listed movie. In a subset of cases (28 PlaybackResource files and 94 X-Ray files), the Amazon US movie pages returned erroneous files that were duplicates of other movie listings. To identify these cases, we compared the unique movie ID generated at the start of our data pipeline with the ID generated from the retrieved PlaybackResource and X-Ray files. For the movies with erroneous PlaybackResource files, IMDb matching was impossible because the title used in the search did not align with the actual cast list, so we manually corrected the metadata for these entries. For the erroneous X-Ray files, IMDb matching was not feasible due to the lack of reliable cast data, so we removed these entries entirely.

Finally, we applied these insights from the unmatched cohort to the larger, automatically matched set to identify any remaining discrepancies in X-Ray and PlaybackResource files. After either correcting or removing problematic entries, we compiled a clean and complete final dataset of 3,265 movies.

Coverage Assessment

The applicability of this dataset depends on its coverage and representativeness of the full range of films produced throughout history. We evaluate this coverage in two ways: by comparing our dataset against IMDb lists of the 100 most popular movies per decade and against lists of Academy Award-winning films by year, sourced from the Academy Awards official website (Table 4). Coverage is sparse in the earlier decades, with only select years, such as 1931, 1932, 1936, and 1939, represented in the 1930s. However, from the 1950s onward, our dataset includes movies from each year, showing progressive improvement in coverage over the decades.

Table 4 Decadal coverage of the Augmented Amazon X-Ray Dataset for the top-100 most popular movies and Academy Awards.

Full size table

Although our datasets’ coverage is moderate relative to these benchmarks, this presents a valuable analytical opportunity: the ability to study films that may not be well-remembered or acclaimed as the best of their time. By sampling movies based on production rather than popularity, our dataset mitigates survival bias, providing a more representative selection of films. This broad coverage, combined with the dataset’s unique scene-level breakdown, is a resource not previously available in film studies.

To further enhance coverage, especially for recent decades, we plan to implement periodic updates. These updates will involve collecting additional data as it becomes available and refining our collection methods to capture more recent releases. The code for this process is available for others to use as well. Additionally, exploring partnerships with movie databases and production companies could provide better access to recent, high-quality metadata. This proactive approach will help ensure that our dataset remains a dynamic and valuable resource for cultural analysis and film studies.

Data availability

The dataset is available at Zenodo with https://doi.org/10.5281/zenodo.17659734³⁵.

Code availability

Code is available on Github at https://github.com/safal312/xray-collector.

References

Belton, J.Movies and mass culture (Bloomsbury Publishing, 1996).
Grindstaff, L. & Turow, J. Video cultures: Television sociology in the “new tv” age. Annual Review of Sociology 32, 103–125 (2006).
Article Google Scholar
Reagan, A. J., Mitchell, L., Kiley, D., Danforth, C. M. & Dodds, P. S. The emotional arcs of stories are dominated by six basic shapes. EPJ data science 5, 1–12 (2016).
Article Google Scholar
Park, M., Park, J., Rojas, F. & Ahn, Y.-Y. Rap music as a social reflection: Exploring the relationship between social conditions and expressions of violence and materialism in rap lyrics. SocArXiv (2024).
Park, M., Thom, J., Mennicken, S., Cramer, H. & Macy, M. Global music streaming data reveal diurnal and seasonal patterns of affective preference. Nature Human Behaviour 3, 230–236 (2019).
Article PubMed Google Scholar
Lee, H. et al. Global music discoveries reveal cultural shifts during the war in ukraine. PsyArXiv (2024).
Liu, L., Dehmamy, N., Chown, J., Giles, C. L. & Wang, D. Understanding the onset of hot streaks across artistic, cultural, and scientific careers. Nature Communications 12, 5392 (2021).
Article ADS PubMed PubMed Central CAS Google Scholar
Lee, K., Park, J., Goree, S., Crandall, D. & Ahn, Y.-Y. Social signals predict contemporary art prices better than visual features, particularly in emerging markets. Scientific Reports 14, 11615 (2024).
Article ADS PubMed PubMed Central CAS Google Scholar
McDonnell, T. E. Cultural objects, material culture, and materiality. Annual Review of Sociology 49, 195–220 (2023).
Article Google Scholar
Park, M., Weber, I., Naaman, M. & Vieweg, S. Understanding musical diversity via online social media. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 9, 308–317 (2015).
Park, M., Park, J., Baek, Y. M. & Macy, M. Cultural values and cross-cultural video consumption on youtube. PLoS ONE 12, e0177865 (2017).
Article PubMed PubMed Central Google Scholar
Ramakrishna, A., Martínez, V. R., Malandrakis, N., Singla, K. & Narayanan, S. Linguistic analysis of differences in portrayal of movie characters. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1669–1678 (2017).
Gorinski, P. J. & Lapata, M. Movie Script Summarization as Graph-based Scene Extraction. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. p. 1066–1076, (Eds. Rada Mihalcea, Joyce Chai, Anoop Sarkar) https://doi.org/10.3115/v1/N15-1113 (Gorinski & Lapata, NAACL 2015).
Davies, M. The Corpus of Contemporary American English (COCA). Available online at https://www.english-corpora.org/coca/ (2008).
Kagan, D., Chesney, T. & Fire, M. Using data science to understand the film industry’s gender gap. Palgrave Communications 6, 1–16 (2020).
Article Google Scholar
Tran, Q. D. & Jung, J. E. Cocharnet: Extracting social networks using character co-occurrence in movies. J. Univers. Comput. Sci. 21, 796–815 (2015).
Google Scholar
Malik, M., Hopp, F. R. & Weber, R. Representations of Racial Minorities in Popular Movies. Computational Communication Research 4, https://doi.org/10.5117/CCR2022.1.006.MALI (2022).
Agarwal, A., Zheng, J., Kamath, S., Balasubramanian, S. & Dey, S. A. Key female characters in film have more to talk about besides men: Automating the bechdel test. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 830–840 (2015).
Lee, O.-J. & Jung, J. J. Story embedding: Learning distributed representations of stories based on character networks. Artificial Intelligence 281, 103235, https://doi.org/10.1016/j.artint.2020.103235 (2020).
Article MathSciNet Google Scholar
Mourchid, Y. et al. Movienet: a movie multilayer network model using visual and textual semantic cues. Applied Network Science 4, 121, https://doi.org/10.1007/s41109-019-0226-0 (2019).
Article Google Scholar
Kaminski, J., Schober, M., Albaladejo, R., Zastupailo, O. & Hidalgo, C. Moviegalaxies - Social Networks in Movies, https://doi.org/10.7910/DVN/T4HBA3 (2018).
Agarwal, A., Balasubramanian, S., Zheng, J. & Dash, S. Parsing screenplays for extracting social networks from movies. In Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL), 50–58 (2014).
Lee, O.-J., Jo, N. & Jung, J. J. Measuring character-based story similarity by analyzing movie scripts. In Text2Story@ ECIR, 41–45 (2018).
Ju, X. et al. Miradata: A large-scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems 37, 48955–48970 (2024).
Google Scholar
Zhang, Q., Yue, Z., Hu, A., Wang, Z. & Jin, Q. MovieUN: A dataset for movie understanding and narrating. In Goldberg, Y., Kozareva, Z. & Zhang, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2022, 1873–1885, https://doi.org/10.18653/v1/2022.findings-emnlp.135 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).
Chen, L. et al. Sharegpt4video: Improving video understanding and generation with better captions. Advances in Neural Information Processing Systems 37, 19472–19495 (2024).
Article Google Scholar
Kayal, P., Mettes, P., Dehmamy, N. & Park, M. Large language models are natural video popularity predictors. In Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, 11432–11464, https://doi.org/10.18653/v1/2025.findings-acl.597 (Association for Computational Linguistics, Vienna, Austria, 2025).
Selenium wire. https://pypi.org/project/selenium-wire/. Accessed: August 2023.
Selenium. https://www.selenium.dev/. Accessed: August 2023.
Unidecode. https://pypi.org/project/Unidecode/. Accessed: August 2023.
Beautifulsoup. https://beautiful-soup-4.readthedocs.io/en/latest/. Accessed: August 2023.
Poggel, L. & Fischer, F. Automatic extraction of network data from amazon prime videos (using ‘1917’ as an example). https://weltliteratur.net/extracting-network-data-from-amazon-prime-videos/ (2022).
Cinemagoer. https://cinemagoer.github.io/. Accessed: September 2023.
Introducing ‘x-ray for movies,’ powered by imdb and available exclusively on the all-new kindle fire family.Amazon.com press center (2012).
Shrestha, S., Heo, Y., Barron, A. T. & Park, M. Scene-level movie data from Amazon X-Ray in the us market combined with IMDb, https://doi.org/10.5281/zenodo.17659734 (2025).

Download references

Acknowledgements

This work was partially supported by the NYUAD Center for Interacting Urban Networks (CITIES), funded by Tamkeen under the NYUAD Research Institute Award CG001.

Author information

These authors contributed equally: Safal Shrestha, Yeonie Heo.

Authors and Affiliations

New York University Abu Dhabi, Abu Dhabi, UAE
Safal Shrestha, Yeonie Heo, Alexander T. J. Barron & Minsu Park

Authors

Safal Shrestha
View author publications
Search author on:PubMed Google Scholar
Yeonie Heo
View author publications
Search author on:PubMed Google Scholar
Alexander T. J. Barron
View author publications
Search author on:PubMed Google Scholar
Minsu Park
View author publications
Search author on:PubMed Google Scholar

Contributions

S.S. and M.P. conceived of the data. S.S. and Y.H. harvested, processed, and validated the data with M.P.’s help. M.P. and A.T.J.B. supervised the project. M.P., Y.H., S.S., and A.T.J.B. wrote the manuscript. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Alexander T. J. Barron or Minsu Park.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Shrestha, S., Heo, Y., Barron, A.T.J. et al. Scene-level movie data from Amazon X-Ray in the US market combined with IMDb. Sci Data 13, 275 (2026). https://doi.org/10.1038/s41597-026-06602-y

Download citation

Received: 14 November 2024
Accepted: 09 January 2026
Published: 20 January 2026
Version of record: 19 February 2026
DOI: https://doi.org/10.1038/s41597-026-06602-y