DomainDemo: a dataset of domain-sharing activities among different demographic groups on Twitter

Yang, Kai-Cheng; Goel, Pranav; Quintana-Mathé, Alexi; Horgan, Luke; McCabe, Stefan D.; Grinberg, Nir; Joseph, Kenneth; Lazer, David

doi:10.1038/s41597-025-05604-6

Download PDF

Data Descriptor
Open access
Published: 16 July 2025

DomainDemo: a dataset of domain-sharing activities among different demographic groups on Twitter

Scientific Data volume 12, Article number: 1251 (2025) Cite this article

1917 Accesses
24 Altmetric
Metrics details

Subjects

Abstract

Social media play a pivotal role in disseminating web content, particularly during elections, yet our understanding of the association between demographic factors and information sharing online remains limited. Here, we introduce a unique dataset, DomainDemo, linking domains shared on Twitter (X) with the demographic characteristics of associated users, including age, gender, race, political affiliation, and geolocation, from 2011 to 2022. This new resource was derived from a panel of over 1.5 million Twitter users matched against their U.S. voter registration records, facilitating a better understanding of a decade of information flows on one of the most prominent social media platforms and trends in political and public discourse among registered U.S. voters from different sociodemographic groups. By aggregating user demographic information onto the domains, we derive five metrics that provide critical insights into over 129,000 websites. In particular, the localness and partisan audience metrics quantify the domains’ geographical reach and ideological orientation, respectively. These metrics show substantial agreement with existing classifications, suggesting the effectiveness and reliability of DomainDemo’s approach.

Posts on central websites need less originality to be noticed

Article Open access 10 September 2022

Name-based demographic inference and the unequal distribution of misrecognition

Article 17 April 2023

Twitter (X) use predicts substantial changes in well-being, polarization, sense of belonging, and outrage

Article Open access 24 February 2024

Background & Summary

Social media play a significant role in the distribution of information^1,2,3, serving as essential sources of information for millions of users, especially in critical contexts such as elections^4,5 and public health crises^6,7,8,9. The content shared on social media originates not only from the users themselves but also from a wide array of sources. In particular, posting and re-sharing links to external websites (URLs) are key mechanisms for disseminating web content on social media^10,11,12,13. For some websites, this is a crucial way of getting traffic and for users to access their content^14,15,16,17. For instance, Chen et al. demonstrate that social media are responsible for a substantial portion of referred visits to thegatewaypundit.com, a popular far-right news website, by analyzing its web traffic data¹⁸.

The significance of social media in information distribution has attracted considerable research attention in recent years. However, our understanding of the role demographic factors play in this process remains limited^19,20, even though their importance in online discourse has been discussed in prior work^21,22,23,24. Typically, the data provided by platforms to researchers lack user-level demographic information^25,26,27. When researchers rely on user donations or surveys to collect demographic data, the sample size is often insufficient to provide meaningful aggregate insights about content-sharing patterns, particularly at the domain level. These challenges in data collection have created a significant gap in our ability to comprehensively analyze the interplay between demographic factors and online political discourse.

To bridge this gap, we introduce a novel dataset, DomainDemo, which quantifies domain-sharing activities across diverse demographic groups on Twitter. Although Twitter was rebranded as X in 2023, we will refer to the platform as Twitter in this article since our dataset predates the rebranding. Our data encompasses demographic details such as age, gender, race, political affiliation, and geolocation (U.S. state). These domain-sharing events are derived from a comprehensive dataset of over 1.5 million U.S.-based Twitter users matched with their voter registration records. Spanning 132 months (11 years) from May 2011 to April 2022, the dataset is organized in monthly intervals, enabling the analysis of temporal trends. Our released datasets allow researchers to investigate the association between demographic characteristics and the sharing patterns of diverse information sources, ranging from mainstream news websites to potential sources of misinformation.

We release two versions of the domain-sharing statistics dataset: DomainDemo-multivariate and DomainDemo-univariate. DomainDemo-multivariate splits the statistics into buckets defined by age, gender, race, political affiliation, state, and time altogether. In each bucket, we provide the number of shares, the number of unique users sharing the domain, and the Gini index²⁸ of the sharing count among users. DomainDemo-univariate includes five universes: state, race, gender, age, and political affiliation. Within each universe, we provide the number of shares, the number of unique users sharing the domains, and the Gini index of the sharing count among users for each category (e.g., age groups in the age universe). In addition to the domain-level statistics, we also provide the distribution of the population in different demographic buckets for both DomainDemo-multivariate and DomainDemo-univariate. These population-level distributions can serve as baselines.

DomainDemo-multivariate is similar in format to the Facebook URL dataset shared through Social Science One^29,30, which includes the number of views, shares, and reactions to various URLs across different demographic groups. However, there are some key differences. DomainDemo-multivariate focuses solely on a sample of Twitter users who are registered voters in the U.S., while the Facebook URL dataset includes data from all eligible users on the platform. Additionally, DomainDemo-multivariate only includes the number of shares and unique users at the domain level and does not incorporate any noise. However, our dataset provides more detailed demographic information compared to the Facebook URL dataset. Specifically, while the Facebook URL dataset only includes country-level geographic data, our dataset contains state-level geolocation information and includes race information.

In addition to the count statistics, we introduce five derived metrics that quantify how the demographics (age, gender, race, political affiliation, and geolocation) of users sharing a particular domain differ from that of the baseline. These metrics have specific interpretations that are useful for various research questions. For instance, the geolocation of the users allows us to measure the domains’ localness, i.e., the extent to which the sharing of a particular domain is geographically localized. Together with the user-sharing behavior data, the localness metric enables researchers to quantify the changing landscape of the local news industry. As a fundamental component of the U.S. democratic process, local news is uniquely positioned to report on local affairs and elections^31,32, but faces a declining trend over the years^33,34. Similarly, the user party affiliation in DomainDemo allows us to measure the audience partisanship of different domains. This metric can serve as a proxy for the political leaning of the domains, crucial for understanding online political discourse^35,36. Our derived metrics demonstrate strong alignment with established measures of localness and political leaning while significantly expanding coverage to over 129,000 domains—over ten times the number of domains in existing datasets. The metrics also uncover subtle variations in sharing patterns that previous binary or one-dimensional categorization schemes could not capture.

Due to the difficulty of obtaining data from social media, especially Twitter, in the post-API era³⁷, replicating our efforts is challenging. Even if access to Twitter data were to become available in the future, the platform itself has undergone significant changes. These factors make our dataset a unique and valuable contribution to the research community, as it provides a comprehensive view of domain-sharing behaviors across an 11-year period.

Methods

In this section, we describe how we create our dataset and provide case studies to interpret the data.

Twitter Panel

Our dataset is based on a panel of over 1.5 million registered U.S. voters on Twitter, created by our team in previous work. A pilot version of the panel was first used by Grinberg et al.³⁸, then the panel was expanded considerably by Shugars et al.²⁰ and validated by Hughes et al.³⁹ To create the panel, we start with the Twitter Decahose, a 10% random sample of all tweets, and identify 290 million accounts that post content between January 2014 and March 2017. We extract the names of the users, either from the Twitter handles or display names, and their location from the account profiles. This information is then matched against voter data provided by TargetSmart in October 2017, covering all 50 U.S. states and the District of Columbia. We compare the full name of each person in the voter file with the names of the Twitter accounts. If the full name has fewer than 10 exact matches, we then examine the location of the Twitter accounts. A Twitter account and voter record pair is accepted only if that is the only person in the specified city or state-level geographic area in both datasets. This reliance on full names and disclosed locations helps to eliminate many fake, automated (bot), and organizational accounts.

The data collection and matching of Twitter panel were approved by the Northeastern University Institutional Review Board (protocol number: 17-12-13). Following the best practices outlined by Hemphill et al.⁴⁰, we employ data aggregation, anonymization, and access control measures to protect user privacy and minimize the risk of re-identification in our Twitter panel.

Matching to voter file records provides access to the geolocation, year of birth, gender, race/ethnicity, and partisanship of the users in the panel. We use state-level geolocation data from the voter files as our geographic unit of analysis. While we have access to more detailed location information, such as county-level data, releasing this information would risk re-identifying the users due to the low population density of many U.S. counties. State-level granularity offers a good compromise between the usefulness of the data and the privacy of the users. Using the year of birth, we determine the age of users at the time of sharing events and categorize them into the following age groups: “<18,” “18-29,” “30-49,” “50-64,” and “65+.” The category for users younger than 18 years old is included because some states allow 17-year-olds to pre-register to vote and some users might be younger than 18 at the time of sharing events. Gender is a binary measure provided by TargetSmart, which does not capture gender identities beyond the binary framework⁴¹. Race/ethnicity information is inferred by TargetSmart for most states and is categorized as “African-American,” “Asian,” “Hispanic,” and “Caucasian.” Other race categories with limited representation in the dataset are aggregated into a single “Other” category to minimize re-identification risks.

TargetSmart provides two measures of partisanship: party registration and inferred partisanship. Party registration information in voter files is self-reported and aligns well with survey self-reporting⁴². However, this information is unavailable for 20 states (AL, AR, GA, HI, IL, IN, MI, MN, MO, MS, MT, ND, OH, SC, TN, TX, VA, VT, WA, and WI) in the TargetSmart data, which account for 42.7% of the Twitter users in our panel. When categorizing party registration information, we treat values for users in the 20 aforementioned states as missing. For the other 30 states and the District of Columbia, users registered as “Democrat” and “Republican” are coded accordingly. Due to variations in the classification of independent registered voters by state, we group individuals listed as “Independent,” “No party,” or “Unaffiliated” into a single “Independent” category. Members of minor parties, such as the Green Party and the Libertarian Party, are categorized as “Other.”

Based on party registration and other indicators, TargetSmart infers the probability of all individuals in all 50 states and the District of Columbia voting Democrat. We categorize individuals as Republican (0-0.35), Independent (0.35-0.65), and Democrat (0.65-1) using TargetSmart’s recommended thresholds to generate the inferred partisanship. For our data release and analysis, we use inferred partisanship as the primary measure since it covers all users (referred to as “party” hereafter). Additionally, we provide party registration information as a secondary measure (referred to as “party registration” or “partyreg” hereafter), as it conveys a slightly different signal and offers useful insights for certain analyses.

Missing values in all dimensions are coded as “Unknown.”

Domain-sharing Statistics

We collect posts from users in the panel spanning from May 2011 to April 2022. We extract the links shared by these users, expand the shortened links when possible, and identify the corresponding domains (e.g., nytimes.com for The New York Times). This process allows us to determine which user shares what domains and when. Sharing events, as defined in our study, include posting links in original tweets and retweeting or quoting tweets containing links. To reduce noise and the risks of re-identification, we include only domains that are shared by at least 50 unique users throughout the entire period.

We integrate the demographic information of users with their domain-sharing records to construct a comprehensive table. This domain-sharing event table includes the following columns: user_id, domain, age, gender, race, party, party registration, state, and year-month. Each row corresponds to a single sharing event, with users who share the same domains multiple times contributing multiple rows. Due to the presence of user identifiers, we cannot release this detailed table. Instead, we provide aggregate statistics derived from this table including DomainDemo-multivariate and DomainDemo-univariate.

DomainDemo-multivariate describes the domain sharing behavior of users across different demographic dimensions simultaneously. It includes several variants designed to facilitate different types of research analyses. The most granular variant is the monthly distribution data at the domain level, which is produced by grouping the domain-sharing event table by domain, age, gender, race, party (party registration is excluded here), state, and year-month. In each bucket, we calculate the following statistics: the number of shares, the number of unique users who share the domain, and the Gini index of the sharing count across users. Formally, the Gini index G for a domain is calculated as:

$$G=\frac{1}{2{N}^{2}\bar{x}}\mathop{\sum }\limits_{i=1}^{N}\mathop{\sum }\limits_{j=1}^{N}|{x}_{i}-{x}_{j}|,$$

(1)

where N is the number of users who share the domain, x_α is the number of shares by the α-th user, and $\bar{x}={\sum }_{i=1\,}^{N}{x}_{i}/N$ is the average number of shares per user. G ranges from 0 to 1, with 0 indicating equal sharing and values close to 1 indicating that a few users share the domain disproportionately. Note that we set the Gini index to 1 for domains shared by only one user in the bucket. We include the Gini index to help researchers understand the inequality of sharing events across users without releasing detailed information about these users due to privacy concerns.

In addition to the demographic distribution of users sharing each domain, it is useful to understand the distribution of the whole population in many cases. Therefore, we also include a “baseline” variant that group the the domain-sharing event table by age, gender, race, party, state, and year-month. In each bucket, we calculate the same statistics as the monthly distribution data. We also provide the average number of unique domains shared by users in each demographic bucket and the corresponding standard deviation. On top of the monthly data, we further provide the distribution and baseline data covering the whole time period. In total, DomainDemo-multivariate includes four variants.

DomainDemo-univariate is generated by aggregating the sharing events across all demographic dimensions except for the one of interest. For example, the state univariate data (referred to as the “state universe”) is produced by aggregating the sharing events across all age, gender, race, and party dimensions, resulting in the statistics of sharing events within different states. In each bucket of DomainDemo-univariate, we provide three statistics: the number of shares, the number of unique users who share the domain, and the Gini index of the sharing count across users. And similar to DomainDemo-multivariate, DomainDemo-univariate includes four variants: monthly distribution, monthly baseline, all-time distribution, and all-time baseline.

The detailed data schema of the released versions of DomainDemo-multivariate and DomainDemo-univariate can be found in the Data Records section.

Derived Metrics for Domains

Based on the sharing behavior of users in different demographic groups, we derive additional metrics that quantify different aspects of the domains.

Domain Localness Metric

Local news is a fundamental component of the U.S. democratic process. It is uniquely positioned to report on local affairs and elections, enabling citizens to engage in local political activities and hold their elected officials accountable^31,32. However, the landscape of local journalism is undergoing significant changes, marked by a notable decline in local agencies, often referred to as the emergence of the “news desert”^33,34. This trend threatens the vibrancy of local political participation and raises concerns about the overall health of democracy^43,44.

To empirically understand the dynamics of news consumption and related phenomena, it is essential to reliably categorize news outlets as either local or national. Despite extensive research efforts, a universally accepted definition of local news organizations remains elusive⁴⁵. Many studies on local news often fail to provide a clear definition or specific criteria for classification⁴⁶, complicating efforts to expand the scope of classification and hindering the replication of analyses.

Here, we leverage the state universe data from DomainDemo-univariate to derive a data-driven metric that quantifies the “localness” of news domains. This is achieved by calculating the deviation of the user distribution of each domain in different states from the baseline distribution, both of which are provided in DomainDemo-univariate.

For a formal definition, let C_s represent the number of unique users in state s across all domains, and F_s represent the corresponding frequency, where F_s = C_s/∑_sC_s. F_s characterizes the baseline distribution of the whole population. For a domain δ, we calculate the user frequency in state s, F_δ,s = C_δ,s/∑_sC_δ,s, where C_δ,s represents the number of unique users in state s who share the domain δ. For domains shared by diverse users, the observed distribution F_δ,s should closely align with the baseline distribution F_s across different states. However, deviations from the baseline distribution are expected for domains with a more concentrated audience.

Following this intuition, we quantify the deviation of a domain, denoted by ${{\mathcal{L}}}_{\delta }$, using the Kullback-Leibler (KL) divergence between F_δ,s and F_s:

$${{\mathcal{L}}}_{\delta }={D}^{(KL)}({F}_{\delta ,s}| | {F}_{s})=\sum _{s}{F}_{\delta ,s}{\log }_{2}\frac{{F}_{\delta ,s}}{{F}_{s}},$$

(2)

where ${F}_{\delta ,s}{\log }_{2}({F}_{\delta ,s}/{F}_{s})$ measures the discrepancies between the observed sharing patterns and the baseline distribution of domain δ in state s. ${{\mathcal{L}}}_{\delta }$ is a non-negative value that is minimized at zero when F_δ,s and F_s are identical. In other words, national news domains should have ${{\mathcal{L}}}_{\delta }$ close to zero, while local news domains should have bigger ${{\mathcal{L}}}_{\delta }$ values. In the Technical Validation section, we show that ${{\mathcal{L}}}_{\delta }$ is a good proxy for the localness of news domains.

A limitation of ${{\mathcal{L}}}_{\delta }$ is that it can only indicate the deviation of a domain’s sharing pattern from the baseline distribution. To reveal which states are over-represented or under-represented, one needs to further inspect the values of F_δ,s and F_s.

Domain Audience Partisanship Metric

A healthy democratic society requires the public to receive accurate and unbiased news and civic information, especially during election seasons⁴⁷. However, the presence of partisan online news and phenomena such as echo chambers and filter bubbles remain concerns^35,48. To address these issues, researchers have investigated the political biases embedded in online platforms, including search engines like Google⁴⁹ and social media platforms like Facebook³⁵, Twitter⁵⁰, and YouTube⁵¹. Other relevant research has focused on how users interact with different information sources and their consumption patterns^36,52,53. Such analyses generally involve assessing the political leanings of numerous domains, but such datasets have been rare and often lack comprehensive coverage (see discussion in the Technical Validation section).

Here, we employ the party (and party registration) universe data from DomainDemo-univariate to create data-driven metrics that assess the audience partisanship of domains. We focus on Democrat and Republican and exclude Independent users. The number of users from each party allows us to quantify the partisanship of the audience for each domain. It is important to note that our audience-based metrics do not evaluate the content characteristics of these domains. However, previous research indicates that audience characteristics are closely associated with the leanings of these domains^54,55.

Formally, the audience partisanship score ${{\mathcal{P}}}_{\delta }$ of a domain δ is calculated as follows:

$${{\mathcal{P}}}_{\delta }=\frac{\frac{{C}_{\delta ,r}}{{C}_{r}}-\frac{{C}_{\delta ,d}}{{C}_{d}}}{\frac{{C}_{\delta ,r}}{{C}_{r}}+\frac{{C}_{\delta ,d}}{{C}_{d}}},$$

(3)

where C_δ,r and C_δ,d (available in the distribution variant of DomainDemo-univariate) represent the number of unique users from the Republican and Democrat parties who share the domain δ, respectively. C_r and C_d (available in the baseline variant of DomainDemo-univariate) represent the total number of unique users in the Republican and Democrat parties who share any domain, respectively. Since a user can share multiple domains, we have C_δ,r ≤ C_r ≤ ∑_iC_i,r and C_δ,d≤ C_d ≤ ∑_iC_i,d. ${{\mathcal{P}}}_{\delta }$ is a continuous value between −1 and +1, where −1 means the domain δ is exclusively shared by Democratic users and +1 means δ is exclusively shared by Republican users.

Other Metrics

In addition to the localness and audience partisanship metrics, we release three more audience-based metrics: age deviation, race deviation, and gender leaning, to help researchers understand the sharing patterns conditioned on these demographic variables. The age and race deviation metrics are calculated using Eq. (2), where the state categories are replaced with the age or race categories. These metrics quantify how concentrated the audience is in certain age or race groups. The gender leaning metric is calculated using Eq. (3), where the party categories are replaced with the gender categories. Similar to the audience partisanship metric, the gender leaning metric is also a continuous value between −1 and +1, where −1 means the domain is exclusively shared by male users and +1 means the domain is exclusively shared by female users.

The calculation of these metrics is very flexible. While we primarily use the unique number of users in both Eqs. (2) and (3), our experiments demonstrate that using the number of shares produces highly correlated results. The metrics can also be calculated over different time periods. In this paper, we present results for the entire time period in the released version, case studies, and validation. To facilitate reproducibility and customization, we provide the code for calculating these metrics, allowing readers to modify the formulas according to their specific needs.

Our formulas in Eqs. (2) and (3) have a limitation: they rely solely on user distribution without accounting for variations in sharing patterns across demographic groups. For example, our analysis reveals that Democratic users share more diverse domains than Republican users, averaging 74.9 unique domains compared to 54.5 across the whole period. To enable researchers to develop more sophisticated metrics that incorporate these behavioral differences, we provide the mean number of unique domains shared by users in each demographic category and the corresponding standard deviations in the baseline variants of our datasets.

Case Studies

To help the readers interpret the derived metrics, we present the distributions for all domains in the dataset and provide case studies for three example domains in Fig. 1.

Firstly, cnn.com, a national news outlet, has a user base closely aligned with the baseline. Consequently, its localness (${{\mathcal{L}}}_{\delta }=0.013$), race deviation (${{\mathcal{L}}}_{\delta }=0.002$), age deviation (${{\mathcal{L}}}_{\delta }=0.049$), and gender leaning (${{\mathcal{P}}}_{\delta }=-\,0.033$) scores are near zero. cnn.com is shared more often by Democratic users and less often by Republican users than the baseline, resulting in an audience partisanship score of −0.132.

The second example, news9.com, is a local news outlet in Oklahoma City, Oklahoma. It is shared by fewer users than cnn.com and has a localness score of 2.072, indicating a localized audience. Figure 1(c) shows that news9.com is over-represented in Oklahoma, confirming its local nature. Additionally, news9.com is shared more often by Republican users and less often by Democratic users compared to the baseline, leading to an audience partisanship score of 0.297. Its user base has race (${{\mathcal{L}}}_{\delta }=0.051$) and gender (${{\mathcal{P}}}_{\delta }=0.026$) profiles similar to the baseline but is shared more often by older users (${{\mathcal{L}}}_{\delta }=0.086$).

The third example is wickedlocal.com, a local news source in Boston, Massachusetts. Figure 1(d) indicates that it is over-represented in Massachusetts, consistent with its localness score of 2.221. Unlike news9.com, wickedlocal.com is shared more often by Democratic users (${{\mathcal{P}}}_{\delta }=-0.387$) and even more often by older users (${{\mathcal{L}}}_{\delta }=0.246$). Otherwise, the user base of wickedlocal.com has a similar profile in terms of race (${{\mathcal{L}}}_{\delta }=0.071$) and gender (${{\mathcal{P}}}_{\delta }=0.060$) to that of news9.com.

Due to space constraints, we can only provide three case studies here. We have released an interactive app to allow readers to explore the patterns of other domains in our dataset at domaindemo.info.

Data Records

Data Access

Our dataset is available on Zenodo (https://doi.org/10.5281/zenodo.15151613)⁵⁶. Given the sensitive nature of the information about Twitter users in our datasets, we have implemented layered access controls. Since DomainDemo-multivariate and DomainDemo-univariate can potentially reveal the identities of Twitter users in the dataset when combined with other datasets, restrictions are imposed on the access to them. Specifically, researchers must complete an application process and sign a data use agreement that prohibits the identification of individual Twitter users and re-distribution of the data. Those interested in accessing these datasets can follow the instructions on the Zenodo page. The derived metrics of the domains, such as localness and audience partisanship scores, are made publicly available.

Data Format

Figure 2 illustrates the folder structure of the DomainDemo dataset. Due to file count limitations on the data hosting platform, each root folder is distributed as a compressed archive. After downloading and extracting these archives, users will find the subfolders and files organized according to the structure depicted in Fig. 2. All data files are provided in CSV format and compressed using the Gzip algorithm for efficient storage and transmission. Users with access can load and analyze them using preferred programming languages, such as Python and R. In the corresponding code repository (see details in the Code Availability section), we provide example scripts to work with the data. In the following we provide the schema of the tables in DomainDemo.