Table 1 Parsed X-Ray data files.

From: Scene-level movie data from Amazon X-Ray in the US market combined with IMDb

Field for Data File

Description

people.csv: Contains a list of actors, corresponding characters, and IMDb name IDs for actors.

name_id

Name ID of person from IMDb. The URL corresponding to a name_id would be https://www.imdb.com/name/ < name_id >. Example: https://www.imdb.com/name/nm0451321for nm0451321.

person

Name of the person.

character

Name of the character in the movie.

scenes.csv: Contains a list of scenes, along with the start and end timestamps of each scene.

scene

Scene number.

start

Scene start timestamp in milliseconds.

end

Scene end timestamp in milliseconds.

people_in_scenes.csv: Contains a list of scenes with IMDb IDs of people appearing in the scene, along with start and end timestamps.

scene

Scene number.

start

Scene start timestamp in milliseconds.

end

Scene end timestamp in milliseconds.

name_id

Name ID of person from IMDb.

timestamp

Timestamp of the character’s first appearance in the scene, in milliseconds.

  1. These files are provided for each film in the schema shown in Fig. 2 under the xrays directory. Note that the scene timestamps in these files are not known to be aligned to the subtitle timestamps contained in .ttml2 files (Fig. 2). However, the example Jupyter notebook demonstrates how subtitles can be assigned to scenes based on temporal overlap. The majority of subtitle segments fall fully within scene boundaries, indicating a strong degree of alignment between the two timestamp sources. Perfect correspondence is not expected, as spoken dialogue may cross visual scene boundaries.