Abstract
We present a multidimensional dataset describing the research productivity of 21st-century Nobel Laureates in Physiology or Medicine from 2000 to 2023, focusing on their publications, patents, retractions, and funding before receiving the Nobel Prize. Examining the research outputs of eminent scientists offers a valuable resource for understanding patterns of productivity and collaboration that may have contributed to impactful scientific advancements. This dataset was created by conducting automated and manual searches of the internet using a variety of publicly available sources, including but not limited to the nobelprize.org website, PubMed, university web and profile pages, the United States Patent and Trademark Office, the NIH RePORTER database, Retraction Watch, and Clarivate. Each entry was meticulously matched to the laureate by cross-checking the above sources, collaborators, content, and production dates. Our unique dataset comprises 12,943 publications, 940 US patents, 17 retractions, and 2,094 peer-reviewed NIH awards produced before winning the Nobel Prize. The data provide multiple descriptors for practical purposes such as research productivity comparisons, national grant program design, or research policy development.
Similar content being viewed by others
Background & Summary
An in-depth understanding of best practices is essential for advancing biomedical research. Nobel Laureates are notable due to their esteemed reputation and influential role in research, significantly advancing scientific knowledge as a whole1. While the number of Nobel laureates is small, studies have observed that some received training under prior laureates, indicating possible mentorship patterns or institutional support systems. However, the transferability of research practices remains an open question2.
Comprehensive and consistent data on transferable skills and optimal research methods used by Nobel laureates are scarce. Several studies have engaged citation methods to analyze Nobel laureates’ publications by citation counts3,4,5,6,7,8,9,10,11,12,13,14,15,16. However, only a small number of laureates are considered to be among the most frequently cited researchers3,8,10. Li et al. determined that Nobel laureates share similarities with other researchers, except for their award-winning contributions14,17. Several often-referenced studies on Nobel laureates are outdated, no longer relevant, and, in certain instances, outright inaccurate (for example, the correlation between citation numbers and Nobel Prize-winning research). Most research studies on Nobel laureates have relied on publication as research productivity data, which includes before and after receiving the Nobel Prize17. It is also important to note that publications alone are unlikely to illuminate the wide-ranging creativity needed for productive research, particularly for Nobel Prize-winning discoveries. A comprehensive examination of Nobel laureates’ research productivity before winning the Prize can provide insights into the breadth and nature of their scientific contributions, offering a resource for further study of excellence in research.
Data-driven studies should provide the most meaningful interpretations of Nobel Prize-winning research. However, there is a scarcity of thorough assessments that are both data-driven and focused on productivity and collaboration. Multiple studies have produced datasets that contain information about Nobel laureates. Li et al. produced an open-access comprehensive data set of publications by laureates in physics, chemistry, and physiology or medicine from 1900–201617, while the dataset by Li et al. was created using Microsoft Academic Graph (MAG), which has since been discontinued, it remains accessible. It could potentially be updated using alternative public sources such as the Nobel Prize website, institutional homepages, and Wikipedia. Amin and Wani created a dataset of diverse details of Nobel Laureates of Physiology or Medicine publications from 2005 to 2008 from the Web of Science database3. However, it is only 4 years of publication productivity. Liang et al. produced a publication impact dataset from the Web of Science for laureates in Physiology and Medicine from 1901–201718. However, they only collected publications and focused mainly on “prize-winning papers.” Jones et al. created a dataset on age dynamics19, and Fortunato analyzed birth dates20. In an analysis of collaboration patterns, Chan et al. developed a dataset of 34,448 publications registered in Scopus of laureates in Chemistry, Physics, and Physiology or Medicine from 1970 to 200021,22. Nevertheless, the publications were only drawn from Scopus, which requires a subscription, and only publications were collected.
Presently, there is an absence of a comprehensive dataset containing academic publications, patents, and funding focused on analyzing 21st-century Nobel laureates’ productivity and collaborative efforts. Using novel data sources makes it feasible to gather information regarding the professional trajectories of Nobel laureates, enabling the development of an all-encompassing dataset that uncovers patterns of productivity and collaboration.
There is a particular shortage of studies that examine multiple Nobel laureates and their non-publication outputs before winning the Prize. A patenting study by Azoulay et al. determined that in medical school faculty members, patenting leads to an increase in the rate of scientific production without compromising its quality23. Another study revealed that laureates are more likely to patent their work than the general scientific population24. However, few data-driven studies have analyzed Nobel laureates and specified their patenting activities. Regarding NIH funding and the receipt of grants, one study showed that the average age at which researchers received their first National Institutes of Health (NIH) funding was higher than when most Nobel laureates completed their groundbreaking work25. Another study determined the degree distribution and NIH grant funding, but neither looked at Nobel laureates nor their NIH funding or grant attainment26. Nonetheless, the factors of patenting and funding grants are often overlooked when considering research productivity. Gaining insights from established research methods can benefit project design and research implementation.
This present dataset responds to the need for more pragmatic and multidimensional descriptors. Specifically, pragmatic measures are those that (i) can be employed as performance evaluation metrics in research and (ii) are mainly under the directional control of every individual researcher.
Methods
Selection of researchers
To understand the research productivity patterns of elite 21st-century scientists, we created a dataset of publications, patents, and funding awards for Nobel Laureates in Physiology or Medicine from 2000 to 2023. Only research output produced by the Nobel laureates prior to receiving the Prize was included in the dataset. The primary objective was to capture a definitive snapshot of the productivity and scholarly contributions that led to Nobel-worthy discoveries. While post-award research activity is undoubtedly interesting to many scholars, several factors informed our decision to limit the dataset to pre-award outputs. First, the Nobel Prize serves as a natural and finite endpoint, ensuring the dataset remains complete and stable over time. Including post-award outputs would necessitate ongoing updates, diminishing the dataset’s utility as a fixed and replicable resource. Second, the current dataset underwent an extensive and time-intensive validation process to ensure accuracy; replicating this process for post-award work would require considerable additional effort. Finally, research conducted after receiving the Nobel Prize is often shaped by external factors27 such as increased visibility28, resources, and collaborations—effects that can confound efforts to understand the original patterns of productivity and collaboration that led to the award itself. For these reasons, we focused exclusively on pre-Nobel data to preserve the dataset’s analytical clarity and historical relevance.
Data sources
Publications
Publication lists for each laureate were compiled and curated via multiple publicly available sources. Initial sources included the Nobel Prize website29, PubMed30, Google Scholar31, affiliation and profile pages, personal web pages, bio sketches, the Wellcome Library Cold Spring Harbor Laboratory Archives32, and other original sources (e.g., news articles). Individual publications and affiliations were verified using personal and institutional web pages, many of which are source URLs in the dataset’s metadata sheets.
Patents
Patent lists for each laureate were compiled from the United States Patent and Trademark Office (USPTO)33 and the World Intellectual Property Organization (WIPO)34.
Funding awards
The list of funding awards for each laureate was compiled from the NIH RePORTER database35 and the European Research Council (ERC) database36.
Journal impact factor
The journal impact factors were collected from a paywalled source, Journal Citation Reports, Clarivate Analytics37.
Retractions
Publication retractions were collected from the Retraction Watch Database38. See Fig. 1 for an overview of the multidimensional dataset creation process, including the source databases (e.g., PubMed, USPTO, NIH RePORTER, ERC, Retraction Watch, and nobelprize.org) and the cross-verification methods used to curate publications, patents, and funding records.
Overview of the multidimensional dataset creation process, including source databases and methods used for cross-verification and curation of publications, patents, and award funding records.
Data collection and curation
Publications
Initially, PubMed30 was searched for the laureate’s last name and first initial, which was used to develop the publication dataset. This list of publications was curated based on the above alternative sources to verify the completeness of each laureate’s publications and assemble a curated dataset of publications for each laureate. The validity of each laureate’s publications was determined using multiple sources, including the researchers’ resumes (which are not always up to date), profile pages (which may only have a few publications), PubMed30 links from the researcher’s page, or profile, Google Scholar31 (but not every researcher verified their profile), and laboratory websites. We cross-checked the publications from individual sources to the PubMed list, removed duplicates, and verified the remaining publications. During the data curation process, personal and institutional web pages were used extensively to verify authorship, affiliations, publication lists, and related information for each laureate. These sources were not treated as formal datasets but served as supplementary verification tools. URLs to these pages are included in the relevant fields within the dataset itself to ensure transparency and traceability. Due to their individualized and often transient nature, these web pages were not included in the formal reference list but are documented within the dataset’s metadata for reference and transparency.
When verifying each publication, the author’s name and spelling were confirmed to match their Nobel Biography29, and the title was compared to their resume’ or published list. If no resume’ or publication list was available, the publications were verified against the laureate’s personal webpage, institution affiliation, content, and collaborators. Each publication was verified similarly: confirmation of the author’s name and spelling, matching the title to their resume’ or published list, matching the date of publication to be between the year of their graduate degree and the year of receipt of the Nobel Prize, and confirmation of matching affiliation and subject matter. If, at this point, the author of the publication was still ambiguous, the coauthor list was analyzed to ensure that the coauthor was listed on another verified publication or within the Nobel Biography.
Additionally, publications cited in the Nobel lecture, the coauthor names in the Nobel interview and biography sections of the nobelprize.org website29, and Wellcome Library archives in the Cold Spring Harbor Laboratory Archives32 were also used to help verify if a particular laureate wrote a publication. Individual PDFs available through Google Scholar, PubMed, and the Internet were also searched to verify authors’ names and institutions. If these factors matched, it was deemed a verified publication; if the factors did not match, the publication was discarded.
Furthermore, publications were analyzed to determine the type of article. They were retained if they were an actual journal article; however, if the publication was an editorial, a published erratum, a comment, an interview, or another non-research article, they were not included in the scientific works. A total of 43,412 publication entries were retrieved from PubMed, and 30,469 (70.2%) were excluded during the curation process due to issues such as incorrect authorship, insufficient verification across independent sources, or classification as non-research content (e.g., editorials, interviews). The exclusion rate was even higher for Google Scholar, where 70,624 entries were obtained, and 57,681 entries—representing 81.7% of the initial results—were excluded based on the same verification and content criteria. In addition to the Title, Date of publication, Journal, and Journal Impact Factor, we included the coauthors for each publication.
The average number of publications for 21st century Nobel Laureates in Physiology or Medicine (2000–2023) was a median of 182 (min 14, max 1066).
Patents
Two databases were searched for pre-Nobel Laureate patents. The USPTO database33, which consisted of searchable patents from 1976 and 2023 and the European Patent Office via the World Intellectual Property Organization (WIPO)34, was manually searched for each laureate based on last name and first name or first initial. We manually curated the patents from those results by verifying the first name, affiliation, location, and subject matter in the abstract. We produced a complete list of patents, including patent titles, numbers, filing dates, and co-inventors. Additional curation ensured the invention date was within the years after their terminal degree but before receipt of the Nobel Prize. A total of 5,457 patent entries were retrieved from the United States Patent and Trademark Office (USPTO). Of these, 4,517 entries (82.8%) were excluded during the curation process due to incorrect authorship or because they fell outside the academic time frame defined for each laureate. Similarly, 2,449 patent entries were retrieved from the European Patent Office (EPO) via the WIPO. After verification, 1,542 entries (63.0%) were excluded based on the same criteria. The final dataset includes a curated and validated list of patents attributed to the laureates before they received the Nobel Prize. Among 21st-century Nobel Laureates in Physiology or Medicine (2000–2023), the median number of validated patents was 6 (min 0, max 161).
NIH Awards
The NIH RePORTER website35, consisting of award funding data after 1985, was manually searched to determine the number and value of NIH awards for each laureate. The database was searched by last name and first name to create a complete list of NIH awards, including title, amount, fiscal year, and coinvestigators. This list was then curated based on first name, middle initial, PI number, project date, subject matter, co-investigators, and the institution where the project occurred. If these factors matched, the project was determined to be the laureates. If not, it was discarded. Not all Nobel laureates received funding through NIH, so only those that did were listed in the NIH Award Excel datasheet. Information regarding the awards retrieved from the NIH RePORTER35 included the primary investigator’s (PI) name, PI number, project titles, project numbers, financial data, and co-investigators. Some cost fields in the dataset are blank because not all funding agencies provide complete funding data. According to the FAQs from NIH RePORT39, “Costs are only available for projects funded by NIH, CDC, FDA, and ACF.”
Of the 12,400 NIH award entries retrieved from the NIH RePORTER database, 10,307 entries (83.1%) were excluded during the curation process due to incorrect authorship or because the project dates fell outside the laureates’ eligible academic timeframes. The final curated dataset includes 2,093 verified NIH awards attributed to the laureates before they received the Nobel Prize. Among 21st-century Nobel Laureates in Physiology or Medicine (2000–2023), the median number of NIH awards was 18.5 (min 0, max 190).
We also searched the European Research Council (ERC) database36 for comparable data. The ERC search yielded 16 entries for five laureates. Following the same rigorous curation protocol, four entries (25%) were excluded due to incorrect attribution or ineligibility based on academic age.
Journal impact factor
Based on the curated list of publications, we compiled a list of journals and their 2023 or closest Journal Impact Factors (JIF) for each laureate. These impact factors supported comparisons, like converting all funding numbers into 2023 U.S. dollars40. The publications not included in the InCites report37 were determined using a web search technique, and journals that were too new or had no impact factor were assigned a JIF of zero. JIFs were incorporated into the publication data sheets. It is important to note that journal publications have changed considerably over the years due to the merging of publishers, splitting of journals into more specific topics, and being brand new or discontinued altogether. An excellent example of this merging and splitting of journals is the journal Physical Review, whose genealogy can be seen on the Physical Review-Wikipedia page41. Physical Review originated in 1893, was renamed in 1913, and split into 14 journals between 1970 and the present. Assigning a journal impact factor for a journal that is this diverse is next to impossible. We ultimately assigned it to the journal closest to the publication date and gave it that journal’s impact factor. The average JIF for 21st-century Nobel Laureates in Physiology or Medicine (2000–2023) was a median of 12.83 (min 3.17, max 39.46).
Retractions
The Retraction Watch Database38 was systematically searched for each Nobel laureate by first and last name. Retractions in the database were verified against each laureate’s list of publications. If the laureate was found to have a retraction, they were included in the Retracted Papers sheet. See Fig. 2 for a visual representation of the workflow used for data collection, exclusion, verification, and curation of publication, patent and award records, including assignment of Journal Impact Factors.
Workflow illustrating the process of data collection for all research products, as well as exclusion, validation, curation, matching, and designation of Journal Impact Factors across curated publication records.
Data Records
The dataset is available at Augusta University Scholarly Commons42, https://doi.org/10675.2/625578. We created an open-access dataset based on the productivity of Nobel Laureates in Physiology or Medicine from 2000 to 2023. The dataset contains 11 Excel Worksheets in a workbook named Nobel Laureates 2000–2023 Multidimensional Research Productivity Dataset. All column headers in the dataset have been designed to be self-explanatory. However, a few terms may require clarification. For instance, Total Academic Age refers to the number of years of active research between the recipient’s terminal degree and the year they were awarded the Nobel Prize. Additional column headers that may not be immediately intuitive are defined in the “Read Me First” sheet for the data set and below in the description of each sheet. The workbook starts with a Read Me First sheet that includes a description of each workbook sheet as follows:
Demographics
This sheet contains the demographics of the Nobel Laureates in Physiology or Medicine from 2000 to 2023. This sheet includes the columns: Nobel Laureate, Gender, Country, Year of Birth, Year of Terminal Degree, Year of Nobel Prize, Age at Nobel Prize, Terminal Degree-Single Medical, Terminal Degree-Single Scientific, Terminal Degree-Dual Degree, Terminal Degree-Other (e.g. masters degree), Academic Age at Nobel Prize (The number of years of active research between the year of earning their terminal degree and the year of winning the Nobel Prize.), and Reason for Nobel Prize. All demographics were sourced from the official Nobel Prize website29 or provided links.
All productivity
This sheet contains the multidimensional pre-award productivity of the Nobel Laureates in Physiology or Medicine from 2000 to 2023. This sheet includes columns for Nobel Laureate, Gender, Country, Year of Nobel Prize, Year of Last Terminal Degree, Total Academic Age (The number of years of active research between the year of earning their terminal degree and the year of winning the Nobel Prize.), Number of Publications, Average Journal Impact Factor, Number of U.S. Patents, Number of WIPO Patents, Number of NIH Awards, Dollar Value of NIH awards, Number of European Research Council (ERC) Awards, and the ERC Contribution in Euros. The results were sourced from within the dataset itself.
Publications
This sheet lists all the research publications produced by the 21st-century Nobel Laureates in Physiology or Medicine before they won the Nobel Prize. This sheet includes the columns: Nobel Laureate, Year of Terminal Degree, Year of Nobel Prize, Total Academic Age, Academic Age at Publication, Publication Year, Title, Journal, Journal Impact Factor, Number of Coauthors, 1st Author, 2nd Author, 3rd Author,… 260th Author. The source of the publications is from the nobelprize.org website29, resumes, PubMed30, Google Scholar31, affiliation websites and profile pages, personal web pages, Biosketch, the Wellcome Library Cold Spring Harbor Laboratory Archives32, and other official sources (e.g., news articles).
USPTO patents
This sheet lists all the research patents the 21st-century Nobel Laureates in Physiology or Medicine produced before they won the Nobel Prize. This sheet includes columns with Nobel Laureate, Year of Terminal Degree, Year of Nobel Prize, Date Patent Filed, Year Patent Filed, Academic Age at Patent filing, Years from Nobel Prize Date, Patent Number, Patent Title, Number of Co-inventors, Inventor, Co-Inventor1, Co-Inventor 2,… Co-Inventor 21.Co-Inventor 21. The patent source for each laureate was the United States Patent and Trademark Office (USPTO)33.
WIPO patents
This sheet lists the intellectual property produced by the Nobel Laureates in Physiology or Medicine from 2000 to 2023 before they won the Nobel Prize, sourced through the World Intellectual Property Organization (WIPO)34. This sheet includes columns with Nobel Laureate, Country, Year of Terminal Degree, Year of Nobel Prize, Title, Inventors, Applicants, Publication Number, Earliest Priority, IPC, CPC, Publication Date, Earliest Publication, and Family Number. For a glossary of terms, see https://worldwide.espacenet.com/patent/help/espacenet-glossary. The patent source for each laureate was from the World Intellectual Property Organization34.
NIH awards
This sheet lists all the NIH Awards granted to the Nobel Laureates in Physiology or Medicine from 2000 to 2023 before they won the Nobel Prize. This sheet includes columns with Nobel Laureate, Gender, Country, Year of Birth, Year of Terminal Degree, Year of Nobel prize, Age at Nobel Prize, NIH Spending Categorization, Project Terms, Project Title, Public Health Relevance, Administering IC, Application ID, Award Notice Date, Opportunity Number, Project Number, Type, Activity I C, Serial Number, Support Year, Suffix, Program Official Information, Project Start Date, Project End Date, Study Section, Subproject Number, Contact PI, Person ID, Contact PI/Project Leader, Other PI or Project Leader(s), Congressional District, Department, Primary DUNS, Primary UEI, DUNS Number, UEI, FIPS, Latitude, Longitude, Organization ID (IPF), Organization Name, Organization City, Organization State, Organization Type, Organization Zip, Organization Country, ARRA Indicator, Budget Start Date, Budget End Date, CFDA Code, Funding Mechanism, Fiscal Year, Total Cost, Total Cost (Sub Projects), Total Cost Plus Sub Projects, Funding IC(s), Direct Cost IC, Indirect Cost IC, NIH COVID-19 Response, Project Abstract, Total Cost IC. The source for the NIH awards came directly from the NIH RePORTER database35. All fields were retained to ensure completeness and maintain alignment with the NIH RePORTER metadata. For a definition of terms, see https://report.nih.gov/exporter-data-dictionary.
ERC Awards
This sheet lists all the European Research Council (ERC) Awards granted to the Nobel Laureates in Physiology or Medicine from 2000 to 2023 before they won the Nobel Prize. This sheet includes columns with Nobel Laureate, Country, Year of Nobel prize, Year of Terminal Degree, Total Academic Age, Year of ERC Award, First Name, Last Name, Proposal Title, Call Number, Contribution Awarded, and Currency Code. The source for the ERC awards came directly from the ERC database36.
Retraction list
This sheet contains a list of the Nobel Laureates in Physiology or Medicine from 2000 to 2023 and if they had a retracted publication as of November 2024. This sheet includes columns with Nobel Laureate, Number of Nobel Publications Retracted, Year of Prize, Title of Paper, Date Published, Year of Retraction, and Reasons for Retractions. The source for the retractions comes from the Retraction Watch Database38.
Retracted papers
This sheet lists the Nobel Laureates in Physiology or Medicine from 2000 to 2023 with a retracted paper. This sheet includes the columns: Laureate with Retractions, Year of the Prize, Title of the Paper, Date Published, Year of Retraction, Reason for Retraction 1–8, and Authors as listed 1–22. The source for the retractions comes from the Retraction Watch Database38.
Resource links
This sheet contains direct URLs used to verify the productivity data associated with each Nobel Laureate. While the majority of information was obtained from the Nobel Prize website, additional verification was conducted using publicly available institutional profiles, lab websites, personal webpages, archived CVs, and publication listings. These individualized sources were used solely for manual validation of publication authorship, affiliations, and grant attribution. Given their large number, variable format, and ephemeral nature, they are not included in the manuscript’s formal reference list. However, this sheet documents them transparently to ensure traceability and reproducibility for future researchers.
In summary, the resulting database includes 940 USPTO patents, 907 WIPO patents, 2,093 NIH awards, 12 ERC awards, 12,943 research publications, and 17 retractions produced by 21st-century Nobel Laureates in Physiology or Medicine (2000–2023) prior to winning the Nobel Prize.
Limitations
The current dataset emphasizes U.S.-based sources of research productivity, including NIH funding; this U.S. focus reflects both the availability and accessibility of reliable data and the predominance of U.S.-affiliated laureates in Physiology or Medicine during the 2000–2023 period.
We acknowledge this geographic concentration as a limitation. We are actively working toward expanding the dataset to incorporate other funding agencies. Future versions of the dataset will aim for a more comprehensive global representation while maintaining high standards for data verification and consistency.
This dataset focuses primarily on funding data from the U.S. National Institutes of Health (NIH). We acknowledge the value of other international funding agencies, including the European Research Council (ERC). Several key factors informed our decision to limit the scope to NIH and ERC funding:
-
1.
Documented Funding Impact: Prior research has identified the NIH and NSF as the most prominent funding sources for Nobel-winning research. A study by Tatsioni, Vavva, and Ioannidis (2010) analyzing Nobel funding from 2000 to 2008 found that although 64 distinct funding sources were involved, the NIH and NSF stood out as primary contributors to prize-winning work43.
-
2.
Scope and Availability: The NIH is the largest public funder of biomedical and behavioral research globally and provides a well-structured, electronically accessible database (NIH RePORTER) with award-level data available dating back to fiscal year 1985; the ERC dates back to 2007. Using the NIH and ERC datasets ensured a consistent, high-quality source for comprehensive data extraction and verification.
-
3.
Relevance to Nobel laureates: Since 2000, 32 recipients of the Nobel Prize in Physiology or Medicine were NIH grantees, compared to only four grantees of the ERC. This difference underscores the NIH’s outsized impact within this scientific cohort.
-
4.
Demographic Representation: Of the 58 laureates included in this dataset, 31 are affiliated with U.S.-based institutions, while 19 are from countries eligible for ERC support. This demographic distribution further justifies prioritizing NIH funding, given its higher coverage within the sample.
Technical Validation
Publications
For technical validation of publications and building the multi-source, comprehensive dataset, we utilized several methods to cross-check the publications’ authorship manually. These sources included the laureate’s affiliation profiles and lab websites, PubMed searches, Orcid Records, internet searches for actual resumes, links through the Howard Hughes Medical Institute (HHMI) (https://www.hhmi.org/search, Europe PubMedCentral (Europe PMC) (https://europepmc.org/), Google Scholar, and Google Scholar Profile. Table 1 provides the percentages of laureates that have been portrayed based on the above-listed sources.
It is important to note that as laureates make career moves or pass away, the information posted on the lab site dissolves into the internet with a 404 error message and is no longer available. Not all publication lists were accurate. For example, one publication listed on a particular laureate’s CV and his publication list from the institution was dated before the Nobel laureate was born, and the subject matter was entirely different. An error was made; it was determined that this publication was not the laureate’s and was removed from the list. Another laureate had a profile page associated with their institution where several publications were not theirs, as the middle initial and content differed.
For 17% of the laureates (10 out of 58), no additional sources, such as a resume or laboratory website, were available to verify their publication lists. We found four laureates with publicly accessible resumes and a complete list of their publications. Sidney Brenner, Edvard Moser, May-Britt Moser, and John Sulston. Brenner’s resume was found in the archives through the Wellcome Library and is available at Cold Spring Harbor Laboratory and Archives as part of the Codebreaker: Makers of Modern Genetics collection, and the Mosers’ and Sulston’s resumes were found online. We have included a sheet in the data collection that contains all the laureates with the various links we used to verify their work; it is unknown how long these links will remain available.
When comparing publication productivity for laureates from our multi-source, comprehensive dataset to Li’s MAG Dataset17, we found that our multidimensional, comprehensive dataset typically contained more publications for each laureate. The MAG dataset was primarily based on automated Google Scholar searches, whereas ours was predominantly focused on PubMed; in cases of medical literature, our search of PubMed was superior in that we did not exclude publications as readily as the MAG dataset. In comparison to the MAG dataset, we found more legitimate publications in 78% (32/41) of the comparisons; in 12% (5/41), we found fewer publications, and only 10% of our comparisons matched (4/41). Interestingly, upon closer analysis of two matched comparisons where the total number of publications was equal. We found that the MAG dataset missed 50 publications while also attributing publication status to 46 non-publications and four incorrect author publications.
To further determine which source provided the most complete and accurate information, we looked at 12 Medicine and Physiology laureates with multiple sources, with complete Resume/Orcid/Google Scholar Profile or complete laboratory profile. Table 2 presents a comparison of 12 Nobel Laureates to assess the completeness and accuracy of publication records across multiple data sources.
The selected 12 laureates were analyzed for completeness of publications based on the availability of an actual resume’, Orcid complete record, lab website or profile page, NLM bibliography, or scholar profile. The two sources were matched to the Multidimensional, comprehensive dataset to determine missing or extra publications. The MAG dataset was found to have missed a considerable number of publications, while the PubMed set had numerous incorrect authors. Both sets included multiple non-articles. The extensive curation of the multi-source, comprehensive dataset appeared to be superior in that the publications are genuine to the laureate, and there are no non-articles or missing articles.
To determine inter-rater reliability, we looked at the difference in publications between the MAG dataset and the multi-source, comprehensive dataset using a Bland-Altman plot. The average difference is 35.2, and the 95% confidence interval for the average difference is [−64.10, 135.53]. Statistical analysis shows no consistent bias, but there is proportional bias with random deviations between the two. See Fig. 3.
Bland-Altman plot comparing publication counts between the MAG dataset and the multidimensional comprehensive dataset. The above Bland-Altman plot indicates that the average difference between the MAG (Li) Dataset and the Multi-Source Comprehensive dataset is 35.2, and the 95% confidence interval for the average difference is [−64.10, 135.53].
Additional analysis of the Multidimensional comprehensive data set; the PubMed data showed a sensitivity (true positive rate) = 97.24% and specificity (true negative rate) = 81.08%. However, the MAG data had a sensitivity (true positive rate) = 73.95% and a specificity (true negative rate) = 2.29%.
Patents
For technical validation of the patents, we relied on the stringency of our name verification and multi-sourcing process to ensure a complete and accurate set of U.S. patents for the 58 Nobel laureates. We first searched the USPTO for last name and first name. Then, from those results, the affiliations, locations, and topics were verified.
NIH Awards
For the validation of the NIH Awards, we found one data set produced by Lauer26 that showed the number of awards depending on the degree type for NIH recipients. While they studied all NIH recipients, not specifically Nobel laureates, our dataset matched almost exactly theirs in the Single Scientific PhD category. However, our dataset differed significantly in the Single Medicine and the Dual Degree (Table 3). Each laureate was searched in the NIH RePORTER database first by last name and first name. Then, based on those results, affiliation and topic were verified.
Usage Notes
While automation is clearly the future of scientific performance assessment, the sourcing of information appears to be a much more critical decision than previously assumed. In the biomedical research field, the PubMed database of the NIH National Library of Medicine appears to be the most complete and highest quality source of information on research publications, including those of Nobel prize winners.
Furthermore, multi-sourcing appears to be an essential step for any research productivity data collection that places an emphasis on accuracy and completeness. The variable definitions of inclusion criteria and chances of errors generated by particular methodologies make many individual sources less than accurate.
In the assessment of research productivity, including Nobel Prize-winning research, the multidimensional analysis provides a much richer picture than the usual single-source, publication-focused study. Recognition and impact of intermediate research achievements on the road to landmark discoveries cannot be visible without looking at the diversity of products and experiences.
Code availability
No custom code was used in the creation of the dataset.
References
Zuckerman, H. Scientific elite: Nobel laureates in the United States. Transaction Publishers (1977).
Chariker, J. H., Zhang, Y., Pani, J. R. & Rouchka, E. C. Identification of successful mentoring communities using network-based analysis of mentor-mentee relationships across Nobel laureates. Scientometrics 111, 1733–1749, https://doi.org/10.1007/s11192-017-2364-4 (2017).
Amin, R., & Wani, Z. A. Characterizing the publications of eminent scientists: the case of Nobel laureates in medicine. Advance and Innovative Research 149, (2021).
Ashton, S. V. & Oppenheim, C. A method of predicting Nobel prizewinners in chemistry. Social Studies of Science 8, 341–348, https://doi.org/10.1177/030631277800800306 (1978).
Antonakis, J. & Lalive, R. Quantifying scholarly impact: IQp versus the Hirsch h. Journal of the American Society for Information Science and Technology 59(6), 956–969, https://doi.org/10.1002/asi.20802 (2008).
Bhattacharya, J., et al Resting on their laureates? Research productivity among winners of the Nobel Prize in Physiology or Medicine. National Bureau of Economic Research Working Paper No. w31352 http://www.nber.org/papers/w31352 (2023).
Chen, Y. & Ding, J. Exploitation and exploration: An analysis of the research pattern of Nobel laureates in Physics. Journal of Informetrics 17, 101428, https://doi.org/10.1016/j.joi.2023.101428 (2023).
Garfield, E. The use of journal impact factors and citation analysis for evaluation of science. In 41st Annual Meeting of the Council of Biology Editors, Salt Lake City, UT. The Use of Journal Impact Factors and Citation Analysis For Evaluation of Science - Presented in Oslo, April 17, 1998 (1998).
Garfield, E. Identifying Nobel class scientists and the uncertainties thereof. In European Conference on Scientific Publication in Medicine and Biomedicine document.pdf (2006).
Garfield, E. & Welljams-Dorof, A. Of Nobel class: A citation perspective on high impact research authors. Theor Med Bioeth. 13, 117–135, https://doi.org/10.1007/BF02163625 (1992).
Gingras, Y. & Wallace, M. Why it has become more difficult to predict Nobel Prize winners: a bibliometric analysis of nominees and winners of the chemistry and physics prizes (1901–2007). Scientometrics 82, 401–412, https://doi.org/10.1007/s11192-009-0035-9 (2010).
Hirsch, J. E. An index to quantify an individual’s scientific research output. Proc. of the Nat. Acad. of Sci. 102, 16569–16572, https://doi.org/10.1073/pnas.0507655102 (2005).
Ioannidis, J. P., Cristea, I. A. & Boyack, K. W. Work honored by Nobel prizes clusters heavily in a few scientific fields. Plos One 15, e0234612, https://doi.org/10.1371/journal.pone.0234612 (2020).
Li, J., Yin, Y., Fortunato, S. & Wang, D. Scientific elite revisited: Patterns of productivity, collaboration, authorship and impact. Journal of the Royal Society Interface 17, 20200135, https://doi.org/10.1098/rsif.2020.0135 (2020).
Wagner, C. S., Horlings, E., Whetsell, T. A., Mattsson, P. & Nordqvist, K. Do Nobel Laureates create prize-winning networks? An analysis of collaborative research in physiology or medicine. PloS One 10, e0134164, https://doi.org/10.1371/journal.pone.0134164 (2015).
Ye, S., Xing, R., Liu, J. & Xing, F. Bibliometric analysis of Nobelists’ awards and landmark papers in physiology or medicine during 1983–2012. Annals of Medicine 45, 532–538, https://doi.org/10.3109/07853890.2013.850838 (2013).
Li, J., Yin, Y., Fortunato, S. & Wang, D. A dataset of publication records for Nobel laureates. Scientific Data 6, 33, https://doi.org/10.1038/s41597-019-0033-6 (2019).
Liang, G. et al. Understanding Nobel prize-winning articles. Current Science 116, 379–385, https://www.jstor.org/stable/27137862 (2019).
Jones, B. F. & Weinberg, B. A. Age dynamics in scientific creativity. Proc. Natl Acad. Sci. USA 108, 18910–18914, https://doi.org/10.1073/pnas.1102895108 (2011).
Fortunato, S. Growing time lag threatens Nobels. Nature 508, 186–186, https://doi.org/10.1038/508186a (2014).
Chan, H. F., Önder, A. S. & Torgler, B. Do Nobel laureates change their patterns of collaboration following prize reception? Scientometrics 105, 2215–2235, https://link.springer.com/article/10.1007/s11192-015-1738-8 (2015).
Chan, H. F., Onder, A. S. & Torgler, B. The first cut is the deepest: repeated interactions of coauthorship and academic productivity in Nobel laureate teams. Scientometrics 106, 509–524, https://doi.org/10.1007/s11192-015-1796-y (2016).
Azoulay, P., Ding, W. & Stuart, T. The impact of academic patenting on the rate, quality and direction of (public) research output. The Journal of Industrial Economics 57, 637–676, https://doi.org/10.1111/j.1467-6451.2009.00395.x (2009).
Nordqvist, K., & Mattsson, P. Nobel Prize awarded discoveries and commercialization: The role of the laureates. In Attributing Excellence in Medicine (pp. 188–206). Brill. https://doi.org/10.1163/9789004406421_011 (2019).
Matthews, K. R., Calhoun, K. M., Lo, N. & Ho, V. The aging of biomedical research in the United States. Public Library of Science One 6, e29738, https://doi.org/10.1371/journal.pone.0029738 (2011).
Lauer, M. A look at the degree types for principal investigators designated on NIH applications and awards: FYs 2014 to 2023. NIH Extramural Nexus https://nexus.od.nih.gov/all/2024/01/30/degree-types-on-nih-applications/ (2024).
Merton, R. K. The Matthew effect in science: The reward and communication systems of science are considered. Science 159, 56–63, https://doi.org/10.1126/science.159.3810.56 (1968).
Zuckerman, H. Nobel laureates in science: Patterns of productivity, collaboration, and authorship. American Sociological Review 32, 391–403, https://doi.org/10.2307/2091086 (1967).
Nobel Prize Outreach AB. NobelPrize.org – Official Website of the Nobel Prize. https://www.nobelprize.org/prizes/ (2024).
National Library of Medicine. PubMed Database. U.S. National Institutes of Health. https://pubmed.ncbi.nlm.nih.gov/ [Accessed: Nov 18, 2024] (2024).
Google LLC. Google Scholar. https://scholar.google.com/ (2024).
Cold Spring Harbor Laboratory. Codebreaker: Makers of Modern Genetics Collection – ArchivesSpace. https://archivesspace.cshl.edu/search (2024).
United States Patent and Trademark Office (USPTO). United States Patent Full-Text and Image Database. Patent Public Search | USPTO. [Accessed: Nov 18, 2024] (2023).
European Patent Office. Espacenet – Patent Search. https://worldwide.espacenet.com/patent/ (2024).
National Institutes of Health. NIH RePORTER: Research Portfolio Online Reporting Tools. https://report.nih.gov/. [Accessed: Nov 18, 2024] (2024).
European Research Council. ERC Funded Projects Database. https://erc.europa.eu/homepage. (2024).
Clarivate Analytics. Journal Citation Reports. Clarivate - Leading Global Transformative Intelligence (2024).
The Center for Scientific Integrity. The Retraction Watch Database. https://retractiondatabase.org/ [Accessed: Nov 18, 2024] (2018).
National Institutes of Health. NIH RePORT Frequently Asked Questions (FAQs). https://report.nih.gov/faqs (2024).
Glänzel, W. & Moed, H. F. Journal impact measures in bibliometric research. Scientometric 53, 171–193, https://doi.org/10.1023/A:1014848323806 (2002).
Wikipedia contributors. Physical Review – Wikipedia. https://en.wikipedia.org/wiki/Physical_Review. [Accessed May 15, 2025] (2024).
Burnett, W. J. & Balas, E. A. Nobel Laureates 2000–2023 Multidimensional Research Productivity Dataset. Augusta University Scholarly Commons. https://doi.org/10675.2/625578 (2024).
Tatsioni, A., Vavva, E. & Ioannidis, J. P. Sources of funding for Nobel Prize-winning work: public or private? FASEB J. 24, 1335–9, https://doi.org/10.1096/fj.09-148239 (2010).
Acknowledgements
The authors thank Zsolt Bagi, MD, PhD; Anatolij Horuzsko, MD, PhD; Vahe Heboyan, PhD; Betty Pace, MD; and Jorges Cortes, MD, for their encouragement and supportive comments through the data collection process.
Author information
Authors and Affiliations
Contributions
Conceptualization: W.J.B. and E.A.B. Data Curation: W.J.B. Formal Analysis: W.J.B. and E.A.B. Funding acquisition: E.A.B. Investigation: W.J.B.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Burnett, W.J., Balas, E.A. A multidimensional research productivity dataset of 21st-century Nobel Laureates in physiology or medicine. Sci Data 12, 1014 (2025). https://doi.org/10.1038/s41597-025-05278-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-05278-0





