Background & Summary

Social media play a significant role in the distribution of information1,2,3, serving as essential sources of information for millions of users, especially in critical contexts such as elections4,5 and public health crises6,7,8,9. The content shared on social media originates not only from the users themselves but also from a wide array of sources. In particular, posting and re-sharing links to external websites (URLs) are key mechanisms for disseminating web content on social media10,11,12,13. For some websites, this is a crucial way of getting traffic and for users to access their content14,15,16,17. For instance, Chen et al. demonstrate that social media are responsible for a substantial portion of referred visits to thegatewaypundit.com, a popular far-right news website, by analyzing its web traffic data18.

The significance of social media in information distribution has attracted considerable research attention in recent years. However, our understanding of the role demographic factors play in this process remains limited19,20, even though their importance in online discourse has been discussed in prior work21,22,23,24. Typically, the data provided by platforms to researchers lack user-level demographic information25,26,27. When researchers rely on user donations or surveys to collect demographic data, the sample size is often insufficient to provide meaningful aggregate insights about content-sharing patterns, particularly at the domain level. These challenges in data collection have created a significant gap in our ability to comprehensively analyze the interplay between demographic factors and online political discourse.

To bridge this gap, we introduce a novel dataset, DomainDemo, which quantifies domain-sharing activities across diverse demographic groups on Twitter. Although Twitter was rebranded as X in 2023, we will refer to the platform as Twitter in this article since our dataset predates the rebranding. Our data encompasses demographic details such as age, gender, race, political affiliation, and geolocation (U.S. state). These domain-sharing events are derived from a comprehensive dataset of over 1.5 million U.S.-based Twitter users matched with their voter registration records. Spanning 132 months (11 years) from May 2011 to April 2022, the dataset is organized in monthly intervals, enabling the analysis of temporal trends. Our released datasets allow researchers to investigate the association between demographic characteristics and the sharing patterns of diverse information sources, ranging from mainstream news websites to potential sources of misinformation.

We release two versions of the domain-sharing statistics dataset: DomainDemo-multivariate and DomainDemo-univariate. DomainDemo-multivariate splits the statistics into buckets defined by age, gender, race, political affiliation, state, and time altogether. In each bucket, we provide the number of shares, the number of unique users sharing the domain, and the Gini index28 of the sharing count among users. DomainDemo-univariate includes five universes: state, race, gender, age, and political affiliation. Within each universe, we provide the number of shares, the number of unique users sharing the domains, and the Gini index of the sharing count among users for each category (e.g., age groups in the age universe). In addition to the domain-level statistics, we also provide the distribution of the population in different demographic buckets for both DomainDemo-multivariate and DomainDemo-univariate. These population-level distributions can serve as baselines.

DomainDemo-multivariate is similar in format to the Facebook URL dataset shared through Social Science One29,30, which includes the number of views, shares, and reactions to various URLs across different demographic groups. However, there are some key differences. DomainDemo-multivariate focuses solely on a sample of Twitter users who are registered voters in the U.S., while the Facebook URL dataset includes data from all eligible users on the platform. Additionally, DomainDemo-multivariate only includes the number of shares and unique users at the domain level and does not incorporate any noise. However, our dataset provides more detailed demographic information compared to the Facebook URL dataset. Specifically, while the Facebook URL dataset only includes country-level geographic data, our dataset contains state-level geolocation information and includes race information.

In addition to the count statistics, we introduce five derived metrics that quantify how the demographics (age, gender, race, political affiliation, and geolocation) of users sharing a particular domain differ from that of the baseline. These metrics have specific interpretations that are useful for various research questions. For instance, the geolocation of the users allows us to measure the domains’ localness, i.e., the extent to which the sharing of a particular domain is geographically localized. Together with the user-sharing behavior data, the localness metric enables researchers to quantify the changing landscape of the local news industry. As a fundamental component of the U.S. democratic process, local news is uniquely positioned to report on local affairs and elections31,32, but faces a declining trend over the years33,34. Similarly, the user party affiliation in DomainDemo allows us to measure the audience partisanship of different domains. This metric can serve as a proxy for the political leaning of the domains, crucial for understanding online political discourse35,36. Our derived metrics demonstrate strong alignment with established measures of localness and political leaning while significantly expanding coverage to over 129,000 domains—over ten times the number of domains in existing datasets. The metrics also uncover subtle variations in sharing patterns that previous binary or one-dimensional categorization schemes could not capture.

Due to the difficulty of obtaining data from social media, especially Twitter, in the post-API era37, replicating our efforts is challenging. Even if access to Twitter data were to become available in the future, the platform itself has undergone significant changes. These factors make our dataset a unique and valuable contribution to the research community, as it provides a comprehensive view of domain-sharing behaviors across an 11-year period.

Methods

In this section, we describe how we create our dataset and provide case studies to interpret the data.

Twitter Panel

Our dataset is based on a panel of over 1.5 million registered U.S. voters on Twitter, created by our team in previous work. A pilot version of the panel was first used by Grinberg et al.38, then the panel was expanded considerably by Shugars et al.20 and validated by Hughes et al.39 To create the panel, we start with the Twitter Decahose, a 10% random sample of all tweets, and identify 290 million accounts that post content between January 2014 and March 2017. We extract the names of the users, either from the Twitter handles or display names, and their location from the account profiles. This information is then matched against voter data provided by TargetSmart in October 2017, covering all 50 U.S. states and the District of Columbia. We compare the full name of each person in the voter file with the names of the Twitter accounts. If the full name has fewer than 10 exact matches, we then examine the location of the Twitter accounts. A Twitter account and voter record pair is accepted only if that is the only person in the specified city or state-level geographic area in both datasets. This reliance on full names and disclosed locations helps to eliminate many fake, automated (bot), and organizational accounts.

The data collection and matching of Twitter panel were approved by the Northeastern University Institutional Review Board (protocol number: 17-12-13). Following the best practices outlined by Hemphill et al.40, we employ data aggregation, anonymization, and access control measures to protect user privacy and minimize the risk of re-identification in our Twitter panel.

Matching to voter file records provides access to the geolocation, year of birth, gender, race/ethnicity, and partisanship of the users in the panel. We use state-level geolocation data from the voter files as our geographic unit of analysis. While we have access to more detailed location information, such as county-level data, releasing this information would risk re-identifying the users due to the low population density of many U.S. counties. State-level granularity offers a good compromise between the usefulness of the data and the privacy of the users. Using the year of birth, we determine the age of users at the time of sharing events and categorize them into the following age groups: “<18,” “18-29,” “30-49,” “50-64,” and “65+.” The category for users younger than 18 years old is included because some states allow 17-year-olds to pre-register to vote and some users might be younger than 18 at the time of sharing events. Gender is a binary measure provided by TargetSmart, which does not capture gender identities beyond the binary framework41. Race/ethnicity information is inferred by TargetSmart for most states and is categorized as “African-American,” “Asian,” “Hispanic,” and “Caucasian.” Other race categories with limited representation in the dataset are aggregated into a single “Other” category to minimize re-identification risks.

TargetSmart provides two measures of partisanship: party registration and inferred partisanship. Party registration information in voter files is self-reported and aligns well with survey self-reporting42. However, this information is unavailable for 20 states (AL, AR, GA, HI, IL, IN, MI, MN, MO, MS, MT, ND, OH, SC, TN, TX, VA, VT, WA, and WI) in the TargetSmart data, which account for 42.7% of the Twitter users in our panel. When categorizing party registration information, we treat values for users in the 20 aforementioned states as missing. For the other 30 states and the District of Columbia, users registered as “Democrat” and “Republican” are coded accordingly. Due to variations in the classification of independent registered voters by state, we group individuals listed as “Independent,” “No party,” or “Unaffiliated” into a single “Independent” category. Members of minor parties, such as the Green Party and the Libertarian Party, are categorized as “Other.”

Based on party registration and other indicators, TargetSmart infers the probability of all individuals in all 50 states and the District of Columbia voting Democrat. We categorize individuals as Republican (0-0.35), Independent (0.35-0.65), and Democrat (0.65-1) using TargetSmart’s recommended thresholds to generate the inferred partisanship. For our data release and analysis, we use inferred partisanship as the primary measure since it covers all users (referred to as “party” hereafter). Additionally, we provide party registration information as a secondary measure (referred to as “party registration” or “partyreg” hereafter), as it conveys a slightly different signal and offers useful insights for certain analyses.

Missing values in all dimensions are coded as “Unknown.”

Domain-sharing Statistics

We collect posts from users in the panel spanning from May 2011 to April 2022. We extract the links shared by these users, expand the shortened links when possible, and identify the corresponding domains (e.g., nytimes.com for The New York Times). This process allows us to determine which user shares what domains and when. Sharing events, as defined in our study, include posting links in original tweets and retweeting or quoting tweets containing links. To reduce noise and the risks of re-identification, we include only domains that are shared by at least 50 unique users throughout the entire period.

We integrate the demographic information of users with their domain-sharing records to construct a comprehensive table. This domain-sharing event table includes the following columns: user_id, domain, age, gender, race, party, party registration, state, and year-month. Each row corresponds to a single sharing event, with users who share the same domains multiple times contributing multiple rows. Due to the presence of user identifiers, we cannot release this detailed table. Instead, we provide aggregate statistics derived from this table including DomainDemo-multivariate and DomainDemo-univariate.

DomainDemo-multivariate describes the domain sharing behavior of users across different demographic dimensions simultaneously. It includes several variants designed to facilitate different types of research analyses. The most granular variant is the monthly distribution data at the domain level, which is produced by grouping the domain-sharing event table by domain, age, gender, race, party (party registration is excluded here), state, and year-month. In each bucket, we calculate the following statistics: the number of shares, the number of unique users who share the domain, and the Gini index of the sharing count across users. Formally, the Gini index G for a domain is calculated as:

$$G=\frac{1}{2{N}^{2}\bar{x}}\mathop{\sum }\limits_{i=1}^{N}\mathop{\sum }\limits_{j=1}^{N}|{x}_{i}-{x}_{j}|,$$
(1)

where N is the number of users who share the domain, xα is the number of shares by the α-th user, and \(\bar{x}={\sum }_{i=1\,}^{N}{x}_{i}/N\) is the average number of shares per user. G ranges from 0 to 1, with 0 indicating equal sharing and values close to 1 indicating that a few users share the domain disproportionately. Note that we set the Gini index to 1 for domains shared by only one user in the bucket. We include the Gini index to help researchers understand the inequality of sharing events across users without releasing detailed information about these users due to privacy concerns.

In addition to the demographic distribution of users sharing each domain, it is useful to understand the distribution of the whole population in many cases. Therefore, we also include a “baseline” variant that group the the domain-sharing event table by age, gender, race, party, state, and year-month. In each bucket, we calculate the same statistics as the monthly distribution data. We also provide the average number of unique domains shared by users in each demographic bucket and the corresponding standard deviation. On top of the monthly data, we further provide the distribution and baseline data covering the whole time period. In total, DomainDemo-multivariate includes four variants.

DomainDemo-univariate is generated by aggregating the sharing events across all demographic dimensions except for the one of interest. For example, the state univariate data (referred to as the “state universe”) is produced by aggregating the sharing events across all age, gender, race, and party dimensions, resulting in the statistics of sharing events within different states. In each bucket of DomainDemo-univariate, we provide three statistics: the number of shares, the number of unique users who share the domain, and the Gini index of the sharing count across users. And similar to DomainDemo-multivariate, DomainDemo-univariate includes four variants: monthly distribution, monthly baseline, all-time distribution, and all-time baseline.

The detailed data schema of the released versions of DomainDemo-multivariate and DomainDemo-univariate can be found in the Data Records section.

Derived Metrics for Domains

Based on the sharing behavior of users in different demographic groups, we derive additional metrics that quantify different aspects of the domains.

Domain Localness Metric

Local news is a fundamental component of the U.S. democratic process. It is uniquely positioned to report on local affairs and elections, enabling citizens to engage in local political activities and hold their elected officials accountable31,32. However, the landscape of local journalism is undergoing significant changes, marked by a notable decline in local agencies, often referred to as the emergence of the “news desert”33,34. This trend threatens the vibrancy of local political participation and raises concerns about the overall health of democracy43,44.

To empirically understand the dynamics of news consumption and related phenomena, it is essential to reliably categorize news outlets as either local or national. Despite extensive research efforts, a universally accepted definition of local news organizations remains elusive45. Many studies on local news often fail to provide a clear definition or specific criteria for classification46, complicating efforts to expand the scope of classification and hindering the replication of analyses.

Here, we leverage the state universe data from DomainDemo-univariate to derive a data-driven metric that quantifies the “localness” of news domains. This is achieved by calculating the deviation of the user distribution of each domain in different states from the baseline distribution, both of which are provided in DomainDemo-univariate.

For a formal definition, let Cs represent the number of unique users in state s across all domains, and Fs represent the corresponding frequency, where Fs = Cs/∑sCs. Fs characterizes the baseline distribution of the whole population. For a domain δ, we calculate the user frequency in state s, Fδ,s = Cδ,s/∑sCδ,s, where Cδ,s represents the number of unique users in state s who share the domain δ. For domains shared by diverse users, the observed distribution Fδ,s should closely align with the baseline distribution Fs across different states. However, deviations from the baseline distribution are expected for domains with a more concentrated audience.

Following this intuition, we quantify the deviation of a domain, denoted by \({{\mathcal{L}}}_{\delta }\), using the Kullback-Leibler (KL) divergence between Fδ,s and Fs:

$${{\mathcal{L}}}_{\delta }={D}^{(KL)}({F}_{\delta ,s}| | {F}_{s})=\sum _{s}{F}_{\delta ,s}{\log }_{2}\frac{{F}_{\delta ,s}}{{F}_{s}},$$
(2)

where \({F}_{\delta ,s}{\log }_{2}({F}_{\delta ,s}/{F}_{s})\) measures the discrepancies between the observed sharing patterns and the baseline distribution of domain δ in state s. \({{\mathcal{L}}}_{\delta }\) is a non-negative value that is minimized at zero when Fδ,s and Fs are identical. In other words, national news domains should have \({{\mathcal{L}}}_{\delta }\) close to zero, while local news domains should have bigger \({{\mathcal{L}}}_{\delta }\) values. In the Technical Validation section, we show that \({{\mathcal{L}}}_{\delta }\) is a good proxy for the localness of news domains.

A limitation of \({{\mathcal{L}}}_{\delta }\) is that it can only indicate the deviation of a domain’s sharing pattern from the baseline distribution. To reveal which states are over-represented or under-represented, one needs to further inspect the values of Fδ,s and Fs.

Domain Audience Partisanship Metric

A healthy democratic society requires the public to receive accurate and unbiased news and civic information, especially during election seasons47. However, the presence of partisan online news and phenomena such as echo chambers and filter bubbles remain concerns35,48. To address these issues, researchers have investigated the political biases embedded in online platforms, including search engines like Google49 and social media platforms like Facebook35, Twitter50, and YouTube51. Other relevant research has focused on how users interact with different information sources and their consumption patterns36,52,53. Such analyses generally involve assessing the political leanings of numerous domains, but such datasets have been rare and often lack comprehensive coverage (see discussion in the Technical Validation section).

Here, we employ the party (and party registration) universe data from DomainDemo-univariate to create data-driven metrics that assess the audience partisanship of domains. We focus on Democrat and Republican and exclude Independent users. The number of users from each party allows us to quantify the partisanship of the audience for each domain. It is important to note that our audience-based metrics do not evaluate the content characteristics of these domains. However, previous research indicates that audience characteristics are closely associated with the leanings of these domains54,55.

Formally, the audience partisanship score \({{\mathcal{P}}}_{\delta }\) of a domain δ is calculated as follows:

$${{\mathcal{P}}}_{\delta }=\frac{\frac{{C}_{\delta ,r}}{{C}_{r}}-\frac{{C}_{\delta ,d}}{{C}_{d}}}{\frac{{C}_{\delta ,r}}{{C}_{r}}+\frac{{C}_{\delta ,d}}{{C}_{d}}},$$
(3)

where Cδ,r and Cδ,d (available in the distribution variant of DomainDemo-univariate) represent the number of unique users from the Republican and Democrat parties who share the domain δ, respectively. Cr and Cd (available in the baseline variant of DomainDemo-univariate) represent the total number of unique users in the Republican and Democrat parties who share any domain, respectively. Since a user can share multiple domains, we have Cδ,r ≤ Cr ≤ ∑iCi,r and Cδ,d ≤ Cd ≤ ∑iCi,d. \({{\mathcal{P}}}_{\delta }\) is a continuous value between  −1 and  +1, where  −1 means the domain δ is exclusively shared by Democratic users and +1 means δ is exclusively shared by Republican users.

Other Metrics

In addition to the localness and audience partisanship metrics, we release three more audience-based metrics: age deviation, race deviation, and gender leaning, to help researchers understand the sharing patterns conditioned on these demographic variables. The age and race deviation metrics are calculated using Eq. (2), where the state categories are replaced with the age or race categories. These metrics quantify how concentrated the audience is in certain age or race groups. The gender leaning metric is calculated using Eq. (3), where the party categories are replaced with the gender categories. Similar to the audience partisanship metric, the gender leaning metric is also a continuous value between  −1 and  +1, where  −1 means the domain is exclusively shared by male users and  +1 means the domain is exclusively shared by female users.

The calculation of these metrics is very flexible. While we primarily use the unique number of users in both Eqs. (2) and (3), our experiments demonstrate that using the number of shares produces highly correlated results. The metrics can also be calculated over different time periods. In this paper, we present results for the entire time period in the released version, case studies, and validation. To facilitate reproducibility and customization, we provide the code for calculating these metrics, allowing readers to modify the formulas according to their specific needs.

Our formulas in Eqs. (2) and (3) have a limitation: they rely solely on user distribution without accounting for variations in sharing patterns across demographic groups. For example, our analysis reveals that Democratic users share more diverse domains than Republican users, averaging 74.9 unique domains compared to 54.5 across the whole period. To enable researchers to develop more sophisticated metrics that incorporate these behavioral differences, we provide the mean number of unique domains shared by users in each demographic category and the corresponding standard deviations in the baseline variants of our datasets.

Case Studies

To help the readers interpret the derived metrics, we present the distributions for all domains in the dataset and provide case studies for three example domains in Fig. 1.

Fig. 1
figure 1

Distributions of the derived metrics for all domains, along with detailed information for three domains: cnn.com, news9.com, and wickedlocal.com. The left column presents the joint distributions of our derived metrics and the unique number of users who share each domain across the entire dataset. The color coding represents the number of domains within each grid cell. The symbols indicate the locations of the three domains. The three columns on the right provide detailed distributions of users in various demographic dimensions for the three domains respectively. Sub-figures (bd) highlight the discrepancies between the observed user distribution and the baseline distribution across U.S. states for the three domains. The color coding indicates the \({F}_{\delta ,s}{\log }_{2}({F}_{\delta ,s}/{F}_{s})\) value in each state. The bar plots display both the baseline distribution and the distribution of users in each demographic category for the domain of interest. The baseline distribution represents the patterns observed across all domains in the dataset.

Firstly, cnn.com, a national news outlet, has a user base closely aligned with the baseline. Consequently, its localness (\({{\mathcal{L}}}_{\delta }=0.013\)), race deviation (\({{\mathcal{L}}}_{\delta }=0.002\)), age deviation (\({{\mathcal{L}}}_{\delta }=0.049\)), and gender leaning (\({{\mathcal{P}}}_{\delta }=-\,0.033\)) scores are near zero. cnn.com is shared more often by Democratic users and less often by Republican users than the baseline, resulting in an audience partisanship score of  −0.132.

The second example, news9.com, is a local news outlet in Oklahoma City, Oklahoma. It is shared by fewer users than cnn.com and has a localness score of 2.072, indicating a localized audience. Figure 1(c) shows that news9.com is over-represented in Oklahoma, confirming its local nature. Additionally, news9.com is shared more often by Republican users and less often by Democratic users compared to the baseline, leading to an audience partisanship score of 0.297. Its user base has race (\({{\mathcal{L}}}_{\delta }=0.051\)) and gender (\({{\mathcal{P}}}_{\delta }=0.026\)) profiles similar to the baseline but is shared more often by older users (\({{\mathcal{L}}}_{\delta }=0.086\)).

The third example is wickedlocal.com, a local news source in Boston, Massachusetts. Figure 1(d) indicates that it is over-represented in Massachusetts, consistent with its localness score of 2.221. Unlike news9.com, wickedlocal.com is shared more often by Democratic users (\({{\mathcal{P}}}_{\delta }=-0.387\)) and even more often by older users (\({{\mathcal{L}}}_{\delta }=0.246\)). Otherwise, the user base of wickedlocal.com has a similar profile in terms of race (\({{\mathcal{L}}}_{\delta }=0.071\)) and gender (\({{\mathcal{P}}}_{\delta }=0.060\)) to that of news9.com.

Due to space constraints, we can only provide three case studies here. We have released an interactive app to allow readers to explore the patterns of other domains in our dataset at domaindemo.info.

Data Records

Data Access

Our dataset is available on Zenodo (https://doi.org/10.5281/zenodo.15151613)56. Given the sensitive nature of the information about Twitter users in our datasets, we have implemented layered access controls. Since DomainDemo-multivariate and DomainDemo-univariate can potentially reveal the identities of Twitter users in the dataset when combined with other datasets, restrictions are imposed on the access to them. Specifically, researchers must complete an application process and sign a data use agreement that prohibits the identification of individual Twitter users and re-distribution of the data. Those interested in accessing these datasets can follow the instructions on the Zenodo page. The derived metrics of the domains, such as localness and audience partisanship scores, are made publicly available.

Data Format

Figure 2 illustrates the folder structure of the DomainDemo dataset. Due to file count limitations on the data hosting platform, each root folder is distributed as a compressed archive. After downloading and extracting these archives, users will find the subfolders and files organized according to the structure depicted in Fig. 2. All data files are provided in CSV format and compressed using the Gzip algorithm for efficient storage and transmission. Users with access can load and analyze them using preferred programming languages, such as Python and R. In the corresponding code repository (see details in the Code Availability section), we provide example scripts to work with the data. In the following we provide the schema of the tables in DomainDemo.

Fig. 2
figure 2

Folder structure of the DomainDemo dataset. We release the multivariate and univariate versions of the domain-sharing statistics. We provide both monthly and all-time variants of the data. Each variant contains distribution and baseline subdirectories. For clarity and due to space limitations, we use wildcards to represent patterns in file names rather than listing each file individually. Specifically, {demo} stands for different demographic factors (i.e., age, gender, race, state, party, and partyreg, six universes in total), and {YYYY-MM} denotes year-month combinations from May 2011 to April 2022 (132 months in total). We also release the derived metrics for domains based on the whole time period. The number of files represented by each pattern is indicated with comments on the right.

Domain-sharing Statistics

The table schema for DomainDemo-multivariate is provided in Table 1. Note that different variants of the dataset have slightly different columns and party registration is not included. Considering the monthly distribution variant, a row with the following values: domain=example.com, state=CA, race=Asian, gender=Female, age=30-49, party=Democrat, year_month=2018-12, shares=50, users=10, gini=0.1 indicates that the domain example.com was shared 50 times in December 2018 by 10 users who live in California, are Asian females aged between 30 and 49, and identify as Democrats. The Gini index of 0.1 indicates that the sharing count is almost evenly distributed across these users. The baseline data, on the other hand, do not have the domain column. A row with the following values: state=CA, race=Asian, gender=Female, age=30-49, party=Democrat, year_month=2018-12, shares=350, users=120, gini=0.5, domains_count_mean=2.2, domains_count_std=3.5 indicates that there were 350 sharing events in December 2018 by 120 users who live in California, are Asian females aged between 30 and 49, and identify as Democrats. These users shared an average of 2.2 unique domains with a standard deviation of 3.5.

Table 1 Schema of the DomainDemo-multivariate tables.

DomainDemo-univariate details the statistics of sharing events across various categories within individual demographic variables. It includes separate sets of statistics (universes) for state, race, gender, age, party, and party registration. The table schema for these statistics is provided in Table 2. A row in the state universe monthly distribution variant with the values: domain=example.com, year_month=2018-04, shares=3,000, users=250, gini=0.8, and state=NY indicates that the domain example.com was shared 3,000 times in April 2018 by 250 users in New York. The Gini index of 0.8 suggests that the sharing count is highly concentrated among a few users. A row in the baseline variant with the values: state=NY, year_month=2018-04, shares=451,000, users=18,250, gini=0.5, domains_count_mean=10.9, domains_count_std=10.2 indicates that there were 451,000 sharing events of any domains in April 2018 by 18,250 users in New York. These users shared an average of 10.9 unique domains with a standard deviation of 10.2.

Table 2 Schema of the DomainDemo-univariate tables.

Derived Metrics for Domains

In addition to the domain-sharing statistics, we also release the derived metrics for domains based on the data from the whole time period. Each of our released file contains two columns: domain and the corresponding metric value. Details of these metrics are provided in Table 3. Note that we offer two versions of the audience partisanship scores: one based on inferred user partisanship and one based on party registration. The audience partisanship scores based on party registration cover fewer domains than other metrics due to the missing values of the party registration information.

Table 3 Derived metrics for domains.

Technical Validation

In this section, we discuss the robustness of the demographic variables in our dataset. We then compare the localness and audience partisanship scores of news domains against existing classifications.

Demographic Variables

The demographic variables of Twitter panel users are the foundation of DomainDemo. While other information is self-reported, the partisanship score and race are inferred by TargetSmart. Although the inference algorithms remain proprietary, multiple lines of evidence support their reliability. For the partisanship score, we find that it highly correlates with the party registration information for individuals registered as Democrats or Republicans in the 30 states plus the District of Columbia where party registration information is available, with a 94% agreement rate. An independent evaluation from the Pew Research Center also suggests this inferred partisanship is reasonably accurate57. Moreover, our previous research validates these scores through their strong alignment with county-level election results20. Similarly, TargetSmart’s race estimates show consistency with different reference points, including self-reported race data from a Pew Research Center survey and results from a statistical inference method20.

The representativeness of the Twitter panel is another important aspect of our dataset. Our previous research has compared the panel with a representative sample of registered voters on Twitter created by the Pew Research Center39. The study shows substantial agreement between the two samples in general, despite some noteworthy differences. In particular, the panel exhibits an overrepresentation of Caucasian users while underrepresenting other racial groups, particularly Hispanic and Asian populations. Additionally, the panel contains a slightly higher share of female users and younger individuals compared to the survey samples.

Here, we further compare the demographic composition of the Twitter panel with all registered voters in the TargetSmart voter file, as illustrated in Fig. 3. Notably, Twitter panel users tend to be younger than registered voters. For other aspects, the panel generally reflects the composition of registered voters with some minor differences. In particular, the panel contains a higher proportion of Caucasian users while underrepresenting other racial groups. Additionally, male users and Republican users are slightly underrepresented in the Twitter panel.

Fig. 3
figure 3

Comparison of the demographic composition of the Twitter panel with that of all registered voters. The age is calculated using 2017 as the reference year.

These comparative analyses offer valuable insights into the representativeness of our Twitter panel. The panel demonstrates reasonable alignment with the broader population of registered voters and voters on Twitter. However, researchers should exercise caution when interpreting results, particularly regarding potential biases in age distribution, gender representation, and especially racial composition.

Domain Localness Metric

In Fig. 1, we present the discrepancies between the observed user distribution and the baseline distribution, i.e., \({F}_{\delta ,s}{\log }_{2}({F}_{\delta ,s}/{F}_{s})\), across different states for three domains. The results indicate that our localness metric can effectively capture and quantify the audience patterns of these domains. To systematically validate our localness score, we focus on news media and compile five existing classifications of local and national news outlets. Table 4 provides a summary of the statistics and information for these datasets. While these lists primarily classify news outlets based on coverage and production perspectives, our approach emphasizes the audience perspective.

Table 4 Summary of existing classifications of local and national news outlets.

We merge all these datasets into a single dataset called meta-ln, which contains 12,905 unique domains. Domains are labeled as local or national when there is a consensus among the original sources. Only 40 domains (0.31%) have inconsistent classifications across different datasets. These inconsistencies mainly arise from the varying definitions adopted by different authors for some borderline outlets. For instance, abc7.com is labeled as national by Cronin et al. but as local by other datasets. We exclude these domains from our analysis and only keep the 4,853 news domains that are present in our dataset for further comparison.

We utilize the Area Under the Receiver Operating Characteristic Curve (AUC) score to assess the alignment between \({{\mathcal{L}}}_{\delta }\) and the existing labels. The AUC score essentially measures the probability that our metric assigns higher \({{\mathcal{L}}}_{\delta }\) values to local domains compared to national domains (as identified by meta-ln). An AUC score of 0.5 indicates random classification, while a score of 1.0 signifies perfect ranking by \({{\mathcal{L}}}_{\delta }\). In our analysis, \({{\mathcal{L}}}_{\delta }\) achieves an AUC score of 0.983, indicating minimal discrepancies between meta-ln and \({{\mathcal{L}}}_{\delta }\). Although our metric captures different signals than meta-ln, the high agreement level validates the accuracy of our localness metric and the robustness of DomainDemo.

In addition to meta-ln, we compare our localness metric with that of Le Quéré et al.58. Le Quéré et al. also adopt a data-driven approach to quantify the localness of news domains from the audience perspective. Specifically, they quantify the “population reach” of news domains by measuring the distance between the locations of the outlets and the users following the outlets on Twitter while accounting for the population density. Since the population reach metric is continuous, we directly calculate its correlation with \({{\mathcal{L}}}_{\delta }\). The intersection between the list shared by Le Quéré et al. and ours has 1,342 domains and yields a Spearman correlation coefficient of 0.441 (p < 0.001), suggesting a moderate agreement between the two metrics.

Although \({{\mathcal{L}}}_{\delta }\) as a continuous value can capture nuanced differences between domains, dichotomizing the value can be beneficial in certain contexts. For news domains, we can use meta-ln to establish a reasonable threshold. By adjusting the threshold value for \({{\mathcal{L}}}_{\delta }\), we can compute the corresponding F1 score, which quantifies the agreement between \({{\mathcal{L}}}_{\delta }\) and the labels in meta-ln, and identify the optimal threshold that minimizes false positives and false negatives. Our calculations indicate that a threshold of 0.243 yields the highest F1 score of 0.978. When dealing with domains outside of meta-ln, researchers can first annotate a set of domains as local or national using meta-ln, and then use these labels to determine the optimal threshold of \({{\mathcal{L}}}_{\delta }\) for their specific study.

Domain Audience Partisanship Metric

In a manner similar to the localness metric, we validate the audience partisanship metric by comparing it with the existing classification of domain political leaning, as detailed in Table 5. The table presents the number of domains common to both the reference dataset and our dataset, along with the Spearman correlation coefficients, all of which are positive and statistically significant at the 0.001 level. Our ratings show a high correlation with those of Bakshy et al.35, Eady et al.59, and Buntain et al.55, which are all audience-based scores derived from social media data. The correlation between our metric and other existing political leaning scores that focus on the sources themselves, such as Allsides and MBFC scores, is lower, suggesting that our metric captures different signals. Nonetheless, these findings demonstrate that our metric effectively captures the audience partisanship of various domains.

Table 5 Summary of the existing domain political leaning scores and their correlations with our audience partisanship scores.

As detailed in Table 3, we also offer a version of the audience partisanship metric derived from party registration information. This metric shows a strong correlation with the audience partisanship scores based on inferred partisanship, exhibiting a Spearman correlation coefficient of 0.917 (p < 0.001) across 129,041 overlapping domains. In Table 5, we further compare this metric with existing political leaning scores. As expected, it produces similar results to the audience partisanship scores based on inferred partisanship.

It is important to recognize that obtaining political leaning ratings for extensive sets of domains is challenging. Most existing datasets listed in Table 5 cover only a few hundred domains, with a couple of them covering over 2,000 domains, primarily focusing on news outlets. In contrast, our dataset includes scores for over 129,000 domains, encompassing a diverse array of websites beyond news sources.

By definition, our audience partisanship metric only encodes relative differences, meaning that the zero point does not necessarily indicate politically neutral. Users seeking a binary classification could consider identifying the least biased domains in their contexts and use them to calibrate our metric50.

Usage Notes

Our dataset offers a comprehensive view of domain sharing patterns on Twitter, capturing variations across demographic groups throughout an extended period. The demographic characteristics of the audiences also reveal distinctive patterns that illuminate the nature of the shared domains.

A key application of our dataset lies in examining the U.S. news media landscape. For researchers interested in this area, we refer them to a curated list of news domains60. Researchers can integrate this list with our dataset to identify and analyze news domains within our collection.

Beyond news media, our dataset encompasses a diverse range of web domains. This includes news-like websites without established editorial standards, such as misinformation sites and “pink slime” websites61. The dataset also extends to various non-news domains, spanning government websites, organizational platforms, entertainment sites, and e-commerce portals.