Background & Summary

Scientific disruptions, characterized by their capacity to challenge existing paradigms and redefine research trajectories, serve as critical catalysts for technological breakthroughs and societal transformation1,2,3. Traditional bibliometric indicators (e.g., citation counts) remain limited in capturing the innovative and revolutionary nature of such work, as they predominantly reflect incremental contributions that reinforce existing knowledge rather than displacing it3,4. This gap motivated the development of the Disruption Index (DI), a metric quantifying the extent to which research eclipses foundational references and replaces established practices5.

The DI index operates on a tripartite citation framework (Fig. 1), analyzing relationships among target articles (current generation, square), their cited references (past generation, diamond), and subsequent citing works (future generation, circle). Bibliographic coupling identifies two distinct contribution types: Consolidating contributions (D = −1), where subsequent works co-cite both the target article and its referenced prior works, signaling incremental advancement, and Disruptive contributions (D = 1), where subsequent works cite only the target article, indicating divergence from prior knowledge trajectories6. Valid application requires sufficient citation context3,7, with empirical studies recommending thresholds of ≥5 references and ≥5 citations2,8.

Fig. 1
figure 1

Simplified illustration of the Disruption Index. Note: Schematic representation of disruption index (DI) values in a citation network with a target article (■), its references (♦), and the resulting citing articles ().

Recent methodological refinements have produced specialized DI variants to address limitations of Disruption Index9. While the earliest alternative, \({{\rm{m}}{DI}}_{1}\)5, remains relatively understudied, subsequent studies focus on distinct challenges. Variants such as \({{DI}}_{1n}\)10, and \({{DI}}_{{\rm{X}} \% }\)11 specifically mitigate noise from highly cited references. Conversely, alternative approaches, including \({{DI}}^{{no}R}\)12, DEP13,14, \({O{rig}}_{{base}}\)15, eliminate dependency on citation thresholds by exclusively analyzing subsequent works that directly cite the focal paper. Moving beyond the conventional binary framework that treats disruption and consolidation as opposing constructs, Chen et al.16 conceptualized them as distinct dimensions and developed corresponding indices (D for Destabilization and C for Consolidation), further introducing the consolidation-disruption disentanglement indexes.

Despite these methodological improvements, existing DI indexes remain susceptible to several types of bias9. First, the use of NR as a denominator introduces inconsistency: NR can exert different effects depending on whether the numerator is positive or negative, which contradicts its theoretical interpretation, consolidating qualities should consistently yield lower or negative scores. Second, the time-dependency of disruption scores remains unresolved, although Bornmann and Tekles17 have recommended a minimum citation window of three years. Third, empirical evidence suggests that the relationship between the number of cited references and disruption scores is non-linear and modulated by discipline, publication age, and the length of the citation window. Fourth, biases may also arise from incomplete coverage within bibliometric databases.

To address these methodological gaps, this study establishes a comprehensive benchmark dataset integrating the Web of Science (WoS), Dimensions, and OpenCitations, hereafter referred to as Cross-source Disruption Indexes (CrossDI)18. The dataset encompasses both established fields (e.g., Synthetic Biology, Astronomy & Astrophysics) and emerging domains (e.g., Blockchain-based Information Systems, Socio-Economic Impacts of Biological Invasions). Within this framework, we systematically compute the full suite of disruption indexes discussed above, including DI, \({{\rm{m}}{DI}}_{1}\), \({{DI}}_{1n}\), \({{DI}}_{{\rm{X}} \% }\), \({{DI}}^{{no}R}\), DEP, \({O{rig}}_{{base}}\), as well as the Destabilization (D) and Consolidation (C) indices, for each article in the dataset, tracking their annual disruption indexes for every year following article up to 2023. In addition, we retain and report key intermediate results and parameters underlying these calculations.

By integrating citation data across multiple databases and providing standardized intermediate variables relevant to different disruption indexes, our dataset enables comprehensive analysis of several types of bias, including database coverage, time window selection, and discipline-specific effects. Ultimately, this resource supports deeper investigation into the intrinsic characteristics and methodological sensitivities of disruption indexes, facilitating more robust and comparative bibliometric studies.

It should be noted that our dataset is supported by our previous works:

  1. (1)

    Xu et al.19 developed a regular expression-based method to systematically identify and automatically correct various typical DOI errors in cited references from the WoS database, thereby significantly improving the quality of citation data.

  2. (2)

    Xu et al.20 compares the Disruption Index across the WoS, Dimensions, and OpenCitations, finding that the Dimensions is a more reliable open alternative to the WoS than the OpenCitations.

Methods

The CrossDI dataset18 is constructed through a systematic workflow that integrates bibliographic metadata from multiple sources. As illustrated in Fig. 2, the generation process comprises three primary phases: (1) Multi-source metadata collection, involving the retrieval of seed articles and their complete citation networks from the WoS, Dimensions, and OpenCitations; (2) Data preprocessing, where key metadata fields, primarily DOIs and publication years, are cleaned and harmonized to ensure cross-database consistency; and (3) Computation of the disruption indexes, which calculates a family of disruption measures from annual citation network snapshots for temporal and cross-source analysis.

Fig. 2
figure 2

A framework for cross-database computation of disruption indexes.

Multi-source metadata collection

The CrossDI dataset is constructed through a systematic integration of bibliographic metadata from three major sources: Web of Science Core Collection (https://www.webofscience.com), Dimensions (https://www.dimensions.ai/), and OpenCitations (https://search.opencitations.net/). The seed articles for the dataset are derived from four distinct research areas. For established fields, the SynBio dataset (2003–2012) comprises 2,584 article records obtained by executing the reproducible search strategy provided by Porter et al.21 and Xu et al.22,while the Astro dataset (2003–2010) utilizes the curated bibliographic data assembled by Gläser et al.23 and Xu et al.24. For emerging domains, the Block-Based Information System Management dataset (2019–2022) was compiled from the reference list of Lei and Ngai25 and the Socio-Economic Impacts of Biological Invasions dataset (2019–2022) was built by applying the search strategy from Diagne et al.26. A stratified random sample of 10 articles per publication year is drawn from each of these source collections to form the final set of target articles. The complete list of DOIs for the final set of 260 target articles is provided as part of the CrossDI dataset18.

For each seed article, the complete citation network, including metadata, cited references, and citing articles, is retrieved from all three sources. Metadata from the WoS is manually exported, while data from the Dimensions is collected programmatically via its API (https://app.dimensions.ai/api/dsl/v2), and data from the OpenCitations is collected via its API. The integration of these heterogeneous data streams is governed by a consistent, DOI-centric methodology. Citation relationships are systematically constructed through DOI linkages across three dimensions: (1) from target articles to their cited references, (2) from target articles to their citing articles, and (3) between the cited references and the citing articles. Citing-article coverage is truncated at the end of 2023; only citations appearing through 2023 are included.

DOI extraction required distinct approaches across sources: the WoS and Dimensions necessitate metadata parsing for DOI retrieval, whereas the OpenCitations provides direct citation linkages through its dual-dataset architecture. The OpenCitations infrastructure maintains two principal resources: the OpenCitations Index (documenting citing-cited entity pairs with temporal metadata; https://api.opencitations.net/index/v2) and OpenCitations Meta (containing rich bibliographic records; https://api.opencitations.net/meta/v1), as detailed by Heibi et al.27. This structural distinction enables more efficient citation network reconstruction from the OpenCitations compared to other sources requiring DOI extraction from heterogeneous metadata fields.

To mitigate errors from ‘non-linked’ records, publications that lack standard identifiers (e.g., DOIs) in sources such as the WoS, we enforce a deterministic inclusion rule: only references and citing items with valid, matchable DOIs are admitted to the integrated network. This approach ensures a consistent and reproducible matching logic across all sources and minimized the risk of false-positive citation links. This guarantees a fair comparison among the three sources. Furthermore, a data quality filter is applied at the article level, retaining only those seed articles that possessed at least five cited references and five citing articles2 identifiable via DOIs, thereby guaranteeing a meaningful citation structure for subsequent use.

Data preprocessing

Persistent DOI inconsistencies in the WoS induce systemic citation linkage errors, compromising analytical validity19,28. While automated cleaning19 rectifies detectable errors, residual ambiguity persists when multiple DOIs exhibit equivalent plausibility. Our cross-database processing procedure reveals six multi-DOI phenomena20: (1) Multi-publisher attribution; (2) Derivative material linkage; (3) Multilingual versioning; (4) Component disaggregation; (5) Serialized articles; and (6) Erroneous assignment. As detailed in Algorithm 1, for the category 1–5 (non-erroneous multi-DOIs), the DOI with rank 1 in alphabetic order is selected as the preferred one. In addition, it is very popular that publication years are missed or even conflicted across different databases. In this time the most recent year is kept, since this simple operation can correct the vast majority of missed or conflicted publication years (see further).

This reproducible decision framework is crucial for the large-scale harmonization of bibliographic records. Notably, it directly addresses citation redundancy on the platforms such as Dimensions, where multiple DOIs may be assigned to an identical works, resulting in artificial inflation of citation counts. By clamping DOIs from multiple sources, our approach can mitigate such database-induced inflation and thereby enhance the computational validity of disruption metrics. This normalization procedure not only reconciles metadata heterogeneity across the WoS, Dimensions, and OpenCitations but also establishes an interoperable foundation, ensuring methodological consistency and facilitating cross-platform comparability of disruption indexes.

Notwithstanding these strengths, we acknowledge the inherent arbitrariness of our preprocessing rules. The selection of the alphanumerically-first DOI as the canonical identifier, while systematic, may not always represent the authoritative version of record. Similarly, the preference for the most recent publication year, though effective in resolving conflicts, may in some cases differ from the initial online availability. We recognize these as intentional methodological trade-offs, where we have prioritized scalability and reproducibility for cross-database analysis over context-specific precision in every instance.

Calculation of the disruption indexes

The final dataset assembly involved constructing local citation networks for each target article using the DOI linkages. To support temporal analysis, annual citation network snapshots are built from publication year until 2023. Within each snapshot, nine disruption index variants are pre-calculated, grouped into four methodological families as detailed below. All measures are computed independently for the WoS, Dimensions, and OpenCitations databases.

Core and impact-weighted indexes

Funk and Owen-Smith5 introduced the DI1 index, with its calculation formula illustrated in Fig. 3. Let’s consider a scenario where there is a focal article (represented by a black square) and its references (depicted as diamonds). The citing articles can be categorized into three groups: those citing only the focal article (represented by single-ring circles formed by dotted lines, counted as NF), those citing only the focal article’s references (single-ring circles made of solid lines, counted as NR), and those citing both the focal article and its references (double-ring circles, counted as NB).

$${{DI}}_{1}=\frac{{N}_{F}-{N}_{B}}{{N}_{F}+{N}_{B}+{N}_{R}}$$
(1)
Fig. 3
figure 3

Graphical representation of calculating the \({{DI}}_{1}\) index.

The original index DI1 ranges in [−1,1] and reflects the direction (disruptive vs consolidating) but not the magnitude of use.

Funk and Owen-Smith5 also proposed the impact-weighted CD index (\({{\rm{m}}{DI}}_{1}\)), with the following formula:

$${{\rm{m}}{DI}}_{1}={m}_{t}\ast \frac{{N}_{F}-{N}_{B}}{{N}_{F}+{N}_{B}+{N}_{R}}$$
(2)

Algorithm 1

Data preprocessing (Reuse prior DOI cleaning + Alphabetical DOI selection + Max-year reconciliation).

Here, mt counts only citations of the target article. This variant mixes direction with magnitude; its scale depends on mt and is therefore not bounded in [−1,1].

Threshold-based variants

A recognized challenge is the potential bias introduced by highly-cited references, which can inflate the NR and NB values. The following variants implement thresholds to mitigate this effect and reduce noise in the citation network10,11. Bornmann et al.10 developed \({{DI}}_{5}\), which required citing articles to reference at least five of the target article’s cited references when calculating the DI1 index.

$${{DI}}_{5}=\frac{{N}_{F}-{N}_{B}^{5}}{{N}_{F}+{N}_{B}^{5}+{N}_{R}}$$
(3)

Additionally, Deng and Zeng11 suggested another disruption index, \({{DI}}_{{\rm{X}} \% }\), which excludes references that fall within the top X% of most-cited articles.

$${{DI}}_{{\rm{X}} \% }=\frac{{\bar{N}}_{F}-{\bar{N}}_{B}}{{\bar{N}}_{F}+{\bar{N}}_{B}+{N}_{R}}$$
(4)

Both DI5 and \({{DI}}_{{\rm{X}} \% }\) utilize thresholds to eliminate the noise caused by highly-cited references. However, these thresholds can be arbitrarily set, potentially introducing biases or subjectivity into the calculation. We acknowledge the inherent arbitrariness in selecting specific threshold values (e.g., 5 references/citations, top X%). Our choices are guided by established conventions in the literature to ensure comparability and robustness1,8,10,29,30. These thresholds represent a pragmatic compromise between mitigating noise from sparse data and maintaining a sufficiently large sample for analysis.

Reference-omitted measures

To avoid threshold arbitrariness and clarify the role of NR, the following several measures omit entirely it. Wu and Yan12 proposed the disruption index, \({{DI}}^{{no}R}\), defined as:

$${{DI}}^{{no}R}=\frac{{N}_{F}-{N}_{B}}{{N}_{F}+{N}_{B}}$$
(5)

Bu et al.13 introduced \({MR}[{cited\_pub}]\), later termed the Dependency Index (DEP) by subsequent work14, defined as:

$${DEP}=\frac{{T}_{R}}{{N}_{F}+{N}_{B}}$$
(6)

Here, TR is the total number of shared references between the focal paper and its citing papers. A higher dependency index indicates a lower level of disruption. To align its interpretive direction with that of the disruption indexes, we adapt the inverse DEP (invDEP) in this study by following the approach of Bittmann et al.14. In more detail, invDEP is calculated by subtracting each DEP value from the sample maximum plus one.

Shibayama and Wang15 introduced a refinement to disruption measurement by shifting the unit of analysis from article-level counts (e.g., NR) to link-level counts within the citing-reference bipartite network. Their base originality index, \({O{rig}}_{{base}}\), is defined as the proportion of non-links in this network, as shown in Eq. (7).

$$Ori{g}_{base}=1-\frac{1}{CR}{\sum }_{C=1}^{C}{\sum }_{r=1}^{R}{x}_{cr}\,with\,{{x}}_{cr}=\{\begin{array}{cc}1 & if\,{c}\,cites\,{r}\\ 0 & {\rm{otherwise}}\end{array}$$
(7)

Here, c denotes the citing articles that reference the focal paper, while r represents the references contained in the focal paper. The total counts of citing articles and references are denoted by capital letters C and R, respectively.

Moving beyond the unidimensional trade-off of \({{DI}}_{1}\), Chen et al.16 re-conceptualized technological evolution with a dual-view framework. A single index that forces a choice between disruption and consolidation cannot capture dual technologies that destabilize some predecessors while consolidating others. To operationalize this, they structurally adapted the tripartite network and defines two indices: D (Destabilization) and C (Consolidation).

$$D=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\frac{{N}_{F}^{i}}{{N}_{F}^{i}+{N}_{B}^{i}+{N}_{R}^{i}}$$
(8)
$$C=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\frac{{N}_{B}^{i}}{{N}_{F}^{i}+{N}_{B}^{i}+{N}_{R}^{i}}$$
(9)

Here, i denotes an arbitrary reference cited by the target article. \({N}_{F}^{i}\) measures the number of articles citing the target articles but not citing reference i, \({N}_{B}^{i}\) measures the number of articles citing both the target articles and reference i, \({N}_{R}^{i}\) measures number of articles citing reference i but not the target articles, and n measures the total number of references in the target articles.

Data overview

Citation data across four research domains, namely Synthetic Biology, Astronomy and Astrophysics, Blockchain-based Information System Management, and the Socio-Economic Impacts of Biological Invasions, are compiled through the integration of three major bibliographic databases: the WoS, Dimensions, and OpenCitations. Table 1 summarizes, for each field database combination, the number of target articles and the field-level averages of (i) references and (ii) citations.

Table 1 The average number of references and citations per article in four fields.

Data Records

The dataset is openly available on Figshare18 and is organized as a data lake encompassing four research fields: Synthetic Biology (ID = 1), Astronomy & Astrophysics (ID = 2), Blockchain-based Information System Management (ID = 3), and Socio-Economic Impacts of Biological Invasions (ID = 4). A consistent folder and file structure is used for each field, where the placeholder {ID} corresponds to the field number, and {SOURCE} denotes the citation data source (viz., WoS, Dimensions, OpenCitations). The data is structured into five core components for each field:

  1. (1)

    Article list (doi/dois-{ID}.csv): This file contains two columns: doi and year. It lists all unique articles that constitute the citation network. It is a key methodological point that only the publication years of citing articles are essential for our disruption metrics calculation. Consequently, the year field is guaranteed to be complete for all such articles but is intentionally left blank for cited references, as this information was not required for the analysis.

  2. (2)

    Citation edges (citations/citations-{ID}-{SOURCE}.csv): This tab-delimited file defines the directed citation relationships with two columns: cited_doi and citing_doi.

  3. (3)

    Target Articles (target/target-{ID}.csv): This single-column file (header: doi) specifies the focal articles for which disruption metrics are computed.

  4. (4)

    Results (result/results-{ID}-{SOURCE}.xlsx): This spreadsheet compiles the disruption indexes and metadata for each target article. It includes: metadata (doi, Publication Year, Y (years since publication), Source), Citation counts (NF, NB, NR, \({N}_{B}^{5}\), \({{N}_{F}}_{{new}}\), alias of \({\bar{N}}_{F}\), and \({{N}_{B}}_{{new}}\), alias of \({\bar{N}}_{B}\)), disruption indexes (\({{DI}}_{1}\), \({{\rm{m}}{DI}}_{1}\), \({{DI}}_{5}\), \({{DI}}^{{no}R}\), \({{DI}}_{3 \% }\), DEP, \({\rm{invDEP}},\) \({O{rig}}_{{base}}\), \({Destabilization}(D)\), \(C{onsolidation}(C)\)).

  5. (5)

    Multi-DOI consolidation (doi/dois-multi-{ID}.csv): No header; each line lists a group of normalized DOIs determined to refer to the same work; the first DOI is taken as the canonical identifier (subsequent DOIs are aliases).

Technical Validation

To quality-control the dataset, we perform three validations: (i) multi-DOI consolidation accuracy (manual stratified audit), (ii) publication-year reconciliation accuracy, and (iii) reference-count consistency versus PDF ground truth; we additionally benchmark coverage overlap against Crossref to surface potential database-induced biases.

Validation of multiple DOI

Different databases frequently assign multiple DOIs to a single article, with Fig. 4 confirming that dual-DOI cases predominate. Manual verification of 100 stratified-sampled merged groups demonstrates 95% deduplication accuracy (cf. Table S1 in Supplementary Information document): 92 groups are correctly consolidated (including identical articles, same-series entries [IDs: 96, 98, 99], or article components [IDs: 91, 95]), while 5 groups are incorrectly merged (IDs: 20, 44, 60, 78, 94; primarily in the OpenCitations). Three groups (IDs: 66, 68, 71) contain unresolvable secondary DOIs but are validated as correct merges via metadata.

Fig. 4
figure 4

Number of articles with multiple DOIs.

Validation of publication year

When reconciling publication years across databases for identical articles, we observe significant year discrepancies. As shown in Fig. 5, the OpenCitations exhibits the highest rate of missing years (9,192 articles), followed by the WoS (97) and Dimensions (4). Figure 6 further reveals temporal patterns in non-missing but conflicting years, indicating that inter-database inconsistency increases over time. To validate our discrepancy-resolution strategy (adopting the maximum year), we perform stratified random sampling by year and manually verify 100 articles. Table S2 in Supplementary Information document shows 12 erroneous assignments within the sample, yielding 88% accuracy. Notably, the maximum year approach introduces temporal bias, particularly for earlier articles.

Fig. 5
figure 5

Distribution of missing publication years across databases.

Fig. 6
figure 6

Number of articles with conflicted year information over year across three databases.

Validation of references

Database-reported reference counts are benchmarked against ground truth extracted from the resulting PDFs (Figs. 710). Across all examined fields (Astronomy & Astrophysics, Blockchain-based Information System Management, Socio-Economic Impacts of Biological Invasions), database counts are consistently less than or equal to the actual reference counts. We argue that a primary reason is that many referenced articles are not assigned any DOI at all. Notably, the Dimensions demonstrates significantly higher reference coverage in emerging fields compared to other databases. Critically, in the field of Synthetic Biology, the OpenCitations or Dimensions reported reference counts substantially exceeding the actual number of references (e.g., articles IDs: 6, 22, 25, 26, 49, 64, 69, 76, 87). In our opinion, this over-counting primarily stems from two mechanisms:

  1. (1)

    DOI Redundancy: the OpenCitations (and the Dimensions, e.g., ID = 87) treated distinct DOI strings referencing the same underlying article as separate DOIs. Let’s take the article with DOI = https://doi.org/10.1128/jb.186.13.4276-4284.2004 as an example. Its references https://doi.org/10.1128/mmbr.40.3.722-756.1976 and https://doi.org/10.1128/br.40.3.722-756.1976 (resolving to identical content) are recorded as two distinct citation relationships (IDs: 6, 49, 64, 69, 76, 87).

  2. (2)

    Extraneous References: Both the OpenCitations and Dimensions include references demonstrably absent from the source article’s reference list and unrelated to its content (e.g., IDs: 22, 25, 26). A representative example is the inclusion of reference with DOI = https://doi.org/10.1152/ajpcell.1997.273.1.c7 for article with DOI = https://doi.org/10.1128/jb.186.13.4276-4284.2004. In fact, no actual citation relationship between them exists.

Fig. 7
figure 7

Validation of database-derived reference counts in Synthetic Biology field.

Fig. 8
figure 8

Validation of database-derived reference counts in Astronomy & Astrophysics field.

Fig. 9
figure 9

Validation of database-derived reference counts in Blockchain-based Information System Management field.

Fig. 10
figure 10

Validation of database-derived reference counts in Socio-Economic Impacts of Biological Invasions field.

Crossref benchmark comparison

The Crossref is chosen as the external validation reference, reflecting its role as the primary DOI registration authority31. Using the 2025 Crossref Public Data File32 (with citations up to 2023), we validate the coverage and completeness of our collected data. Our comparative analysis find that the Dimensions provides the most comprehensive citation coverage, followed by the WoS, with the OpenCitations exhibiting the lowest coverage. However, supervisingly in the Synthetic Biology field, the Crossref’s coverage was lower than the WoS, primarily due to systematic omissions of references in certain key articles.

Figures 1114 further illustrate the distribution and overlap of records across the WoS, Dimensions, OpenCitations and Crossref in four fields. Each figure utilizes an UpSet plot to display both the total number of citations indexed by each database (left bars) and the size of intersections among them (top bars). For instance, in Fig. 14, the Dimensions contains 2,260,464 citations and the WoS includes 2,025,014 citations, with 165,306 citations shared exclusively between the WoS and Dimensions, absent from the OpenCitations and Crossref. Notably, a substantial number of records are shared among all four databases, indicating considerable overlap in core articles. These findings highlight the necessity of integrating multiple data sources for comprehensive bibliometric analysis and emphasize the risk of coverage bias when relying on a single database.

Fig. 11
figure 11

Comparative analysis of citation overlaps among major bibliographic databases in Synthetic Biology field. Left bars show total citation counts per database. Filled circles in the matrix indicate database combinations, and top bars represent intersection sizes.

Fig. 12
figure 12

Comparative analysis of citation overlaps among major bibliographic databases in Astronomy & Astrophysics field. Left bars show total citation counts per database. Filled circles in the matrix indicate database combinations, and top bars represent intersection sizes.

Fig. 13
figure 13

Comparative analysis of citation overlaps among major bibliographic databases in Blockchain-based Information System Management field. Left bars show total citation counts per database. Filled circles in the matrix indicate database combinations, and top bars represent intersection sizes.

Fig. 14
figure 14

Comparative analysis of citation overlaps among major bibliographic databases in Socio-Economic Impacts of Biological Invasions field. Left bars show total citation counts per database. Filled circles in the matrix indicate database combinations, and top bars represent intersection sizes.

Usage Notes

Our CrossDI dataset supports a variety of research applications. First, it enables systematic analysis of the properties and dynamic evolution of disruption indexes over time. Annual DI values across multiple variants allow researchers to examine trends and patterns in scientific disruption. Second, the dataset covers both established and emerging fields providing a basis for investigating disciplinary differences in term of disruption indexes. Third, by combining citation data from multiple sources (WoS, Dimensions, and OpenCitations), the dataset allows users to assess how database characteristics and data coverage affect the calculation and interpretation of disruption indexes. Furthermore, standardized intermediate variables are included to help identify sources of methodological bias, such as time window selection or discipline-specific citation practices. This facilitates sensitivity analyses and supports the development of more robust and comparable DI measures. Moving beyond these core applications, the dataset uniquely empowers several advanced research avenues: the structured citation networks facilitate sophisticated citation network mining to model the propagation of disruptive ideas; the comprehensive suite of indicators and their temporal evolution provides a rich feature set for predictive modeling using machine learning; and these standardized metrics enable systematic anomaly detection by revealing scholarly outliers with atypical disruption trajectories. Overall, the dataset offers a harmonized and flexible benchmark for exploring the disruptive nature of scholarly works across disciplines and data infrastructures.