Antibiotic Resistance Microbiology Dataset (ARMD): A Resource for Antimicrobial Resistance from EHRs

Nateghi Haredasht, Fateme; Amrollahi, Fatemeh; Maddali, Manoj V.; Marshall, Nicholas; Ma, Stephen P.; Cooper, Lauren N.; Johnson, Andrew O.; Wei, Ziming; Medford, Richard J.; Kanjilal, Sanjat; Banaei, Niaz; Deresinski, Stanley; Goldstein, Mary K.; Asch, Steven M.; Chang, Amy; Chen, Jonathan H.

doi:10.1038/s41597-025-05649-7

Download PDF

Data Descriptor
Open access
Published: 26 July 2025

Antibiotic Resistance Microbiology Dataset (ARMD): A Resource for Antimicrobial Resistance from EHRs

Fateme Nateghi Haredasht ORCID: orcid.org/0000-0002-8874-8835¹,
Fatemeh Amrollahi¹,
Manoj V. Maddali²,
Nicholas Marshall ORCID: orcid.org/0009-0003-6051-5890³,
Stephen P. Ma ORCID: orcid.org/0000-0003-3738-9569⁴,
Lauren N. Cooper⁵,
Andrew O. Johnson⁶,
Ziming Wei⁷,
Richard J. Medford^5,8,
Sanjat Kanjilal ORCID: orcid.org/0000-0002-1221-5725⁷,
Niaz Banaei^9,10,
Stanley Deresinski⁹,
Mary K. Goldstein¹¹,
Steven M. Asch¹²,
Amy Chang⁹ &
…
Jonathan H. Chen^1,4,13

Scientific Data volume 12, Article number: 1299 (2025) Cite this article

7195 Accesses
1 Citations
2 Altmetric
Metrics details

Subjects

Abstract

The Antibiotic Resistance Microbiology Dataset (ARMD) is a de-identified resource derived from electronic health records (EHR) that facilitates research in antimicrobial resistance (AMR). ARMD encompasses big data from adult patients collected from over 15 years at two academic-affiliated hospitals, focusing on microbiological cultures, antibiotic susceptibilities, and associated clinical and demographic features. Key attributes include organism identification, susceptibility patterns for 55 antibiotics, implied susceptibility rules, and de-identified patient information. This dataset supports studies on antimicrobial stewardship, causal inference, and clinical decision-making. ARMD is designed to be reusable and interoperable, promoting collaboration and innovation in combating AMR. This paper describes the dataset’s acquisition, structure, and utility while detailing its de-identification process.

Antimicrobial resistance databases: opportunities and challenges for public health

Article Open access 08 January 2026

Affordable and real-time antimicrobial resistance prediction from multimodal electronic health records

Article Open access 16 July 2024

Integrating socioeconomic deprivation indices and electronic health record data to predict antimicrobial resistance

Article Open access 28 March 2025

Background & Summary

Antimicrobial resistance (AMR) has emerged as a critical global health threat, compromising the effectiveness of antibiotics and leading to increased morbidity and mortality. In 2019, AMR was associated with nearly 5 million deaths worldwide, with at least 1.27 million directly attributable to resistant infections^1,2. In the United States alone, over 2.8 million antimicrobial-resistant infections occur annually, resulting in more than 35,000 deaths³. AMR occurs when microorganisms such as bacteria, viruses, fungi, and parasites evolve mechanisms to withstand the effects of antimicrobial agents^4,5. It also encompasses the selection and proliferation of organisms that inherently resist specific treatments, even without prior antimicrobial exposure. This problem is exacerbated by the overuse and misuse of antibiotics in clinical, agricultural, and community settings, which creates selective pressure favoring resistant strains.

Efforts to combat AMR require robust data resources to better understand resistance patterns, evaluate clinical practices, and develop evidence-based recommendations or practices for antimicrobial stewardship. However, comprehensive datasets that integrate microbiological and clinical data are rare. Moreover, the dynamic nature of resistance development necessitates datasets that capture temporal trends and patient-specific factors influencing AMR. Real-world data from electronic health records (EHR) offer a valuable opportunity to address this gap by providing granular information on microbial cultures, patient characteristics, and treatment outcomes^6,7,8,9. Yet, creating meaningful and reliable datasets from EHR data presents several challenges, including heterogeneity in data representation, the need for rigorous de-identification, and ensuring data quality and interpretability.

Several publicly available datasets facilitate the study of AMR by providing genomic, phenotypic, and epidemiological insights. Resources such as the National Database of Antibiotic Resistant Organisms (NDARO)¹⁰ and the Comprehensive Antibiotic Resistance Database (CARD)¹¹ focus on genetic determinants of resistance, while platforms like NARMS Now¹² and ResistanceMap¹³ offer population-level surveillance data. Additionally, datasets like AMR-UTI¹⁴ provide clinical insights from urinary tract infections, though they often lack comprehensive metadata linking microbiological findings to patient care and treatment outcomes. While these datasets contribute significantly to antimicrobial resistance research, they often remain siloed in their focus, emphasizing either genetic markers, population-level surveillance, or isolated clinical findings.

The Antimicrobial Resistance Microbiology Dataset (ARMD) offers a uniquely integrated dataset that combines microbiological culture data, antibiotic susceptibility results, patient demographics, clinical history, and treatment exposures from a large, real-world hospital setting. By bridging laboratory and clinical data, ARMD enables deeper epidemiological analyses, supports the development of predictive models for empirical treatment strategies, and provides a foundation for studying antimicrobial stewardship in real-world healthcare environments. This dataset serves as a novel resource that facilitates both broad surveillance and detailed patient-level analyses to inform future AMR research and clinical decision-making.

The ARMD dataset further addresses key challenges in antimicrobial resistance research by providing a robust, longitudinal dataset derived from de-identified EHR data at Stanford Health Care. Spanning multiple years and including over 280,000 unique patients, ARMD captures a diverse patient population and integrates microbiological, clinical, and longitudinal patient-level data to create a comprehensive resource for studying antimicrobial resistance patterns. The microbiological data within ARMD includes detailed information on microbiology specimens, such as body site, organism identification, and antibiotic susceptibility profiles. Unlike many existing datasets, ARMD also includes records of negative cultures, which serve as valuable indicators for assessing disease severity, estimating treatment success or failure, and understanding patterns of microbial clearance over time.

ARMD’s comprehensive structure supports a wide range of research applications. It enables trend analysis, facilitating the monitoring of temporal shifts in resistance patterns across different organisms and clinical settings. The dataset is also valuable for risk factor identification, allowing researchers to assess how demographic and clinical characteristics contribute to the development of resistant infections. Additionally, ARMD serves as a foundation for predictive modeling efforts, aiding in the development of machine learning algorithms to predict resistance emergence and optimize empiric antibiotic therapy. The insights derived from ARMD can further inform policy development, guiding antimicrobial stewardship programs and public health strategies aimed at mitigating the spread of resistance^{15,16,17,18,19}. Making this dataset openly accessible encourages collaboration and innovation in AMR research, supporting global efforts to tackle this urgent public health threat.

Methods

Data acquisition

The ARMD dataset was developed using de-identified EHR from Stanford Health Care, encompassing a broad range of microbiological, clinical, and demographic data collected from 1999 to 2024. The dataset integrates microbiology laboratory results, demographic data, clinical encounters, antibiotic exposures, and socioeconomic indicators to enable comprehensive analyses of AMR patterns.

Stanford Health Care uses the Epic EHR system to manage patient records. Data from Epic’s operational database (Chronicles)—which is optimized for real-time transactional processing—are regularly extracted into Clarity, Epic’s relational database designed for reporting and research purposes. At Stanford Health Care, the Clarity database is built on an Oracle-based system. For research and data analysis purposes, data from Clarity is integrated into the STAnford medicine Research data Repository (STARR)²⁰, which serves as a centralized data lake²¹. Access to STARR data was granted under Stanford IRB approval with review and oversight by the Privacy Office and Hospital Compliance Office to ensure regulatory compliance and patient privacy. Data extraction for ARMD was conducted using structured SQL queries executed on STARR’s BigQuery interface. From STARR, relevant data for ARMD—such as microbiological cultures, laboratory test results, vital signs, medication exposures, and patient demographics—were extracted. The extraction process utilized Google BigQuery, a managed, cloud-based data warehouse that enables fast and scalable querying of large datasets. Researchers accessed BigQuery through a secure Virtual Private Network (VPN) using Cisco technology, ensuring data privacy and compliance with institutional security protocols. Organism identification was performed using Matrix-Assisted Laser Desorption Ionization Time-of-Flight (MALDI-TOF) mass spectrometry (Bruker Biotyper). Antibiotic susceptibility testing was conducted using the Vitek2 instrument (bioMérieux) for blood and urine cultures and the MicroScan WalkAway system (Beckman Coulter) for respiratory cultures. Minimum inhibitory concentrations were interpreted based on Clinical & Laboratory Standards Institute (CLSI) breakpoints.

Inclusion/exclusion criteria

The ARMD dataset includes both inpatient and outpatient cultures from adult patients (aged 18 years or older) with urine, blood, and respiratory cultures. These three culture types were selected due to their clinical significance in antimicrobial resistance research, representing common sites of bacterial infections. Fungal, viral, and parasitic cultures were not explicitly included, as ARMD primarily focuses on bacterial resistance. To enhance data relevance and minimize redundancy, repeated cultures from the same patient within a two-week period were excluded. The dataset includes both positive and negative culture results, with positivity determined by the identification of specific organisms.

Data processing & transformation

Organism and antibiotic names were standardized to resolve inconsistencies caused by varying nomenclature or formatting. When explicit susceptibility results were unavailable, intrinsic resistance was determined using Clinical and Laboratory Standards Institute (CLSI) standards, and implied susceptibility was inferred by linking susceptibility results between related antibiotics using predefined Stanford Microbiology Lab protocols, which are also based on CLSI standards^22,23,24. These rules, documented in the related file, were systematically applied across all records. For example, susceptibility to an earlier-generation cephalosporin (e.g., cefazolin) implied susceptibility to a later-generation cephalosporin (e.g., ceftriaxone) based on established microbiological principles. These rules, documented in the related file, were systematically applied across all records. All data were de-identified following the Safe Harbor method in accordance with the National Institute of Standards and Technology (NIST) guidelines. Additionally, the clinical text was anonymized using the TiDE algorithm to ensure compliance with privacy regulations²⁵. De-identification was performed in compliance with the Health Insurance Portability and Accountability Act (HIPAA) and Stanford Health Care’s privacy regulations. Specifically, demographic data were anonymized by replacing exact ages with predefined age bins (e.g., 18–24, 25–34) and grouping all patients aged 89 or older into a single “90 + ” category. Sex was anonymized as binary values (0 and 1), with no further specification of sex labels. All date and time fields, including culture order dates, laboratory test dates, and medication administration times, underwent temporal jittering. This process involves applying random offsets to time-related data at the patient level, obscuring exact dates while preserving the relative temporal relationships essential for longitudinal analyses. No statistical imputation was applied, ensuring that users of the dataset could handle missing data according to their specific research methodologies.

Data structure & schema

The data are structured to reflect the clinical timeline relevant to microbiological culture collection, including patient-level factors, clinical data, and culture-specific results. Figure 1 illustrates the data flow and relationships among the various data elements, highlighting how patient demographics, healthcare exposures, clinical data, and microbiological findings are linked.

The Patient-Level Data layer includes patient characteristics such as demographics (age, sex), socioeconomic indicators via the Area Deprivation Index (ADI), comorbidities (derived from the Elixhauser Comorbidity Index), and nursing home visits, which are known to confer antimicrobial resistance risk. The Clinical Context layer provides details about the care environment and relevant exposures surrounding culture collection, including ward information (e.g., intensive care unit [ICU], emergency department [ED], inpatient, outpatient) and prior exposures to antibiotics, medications, or procedures that may impact infection risk or resistance patterns. The Culture Collection layer focuses on the specific microbiological culture, capturing laboratory results (e.g., white blood cell counts, lactate levels) recorded within the 14 days preceding the culture order time and vital signs (e.g., heart rate, blood pressure, temperature) recorded within the 48 hours preceding the culture order. The Culture Data layer contains the results of the microbiological analysis, including culture type and positivity, organism identification, and antibiotic susceptibility profiles. It also includes implied susceptibility, inferred using established microbiological rules^22,23,24 to facilitate analysis when direct testing was not performed.

Ethical considerations

While the ARMD dataset has undergone rigorous de-identification processes, ethical data use remains paramount. Researchers should apply appropriate data security measures and respect the ethical guidelines outlined in the dataset’s documentation. This study was approved by the Stanford University Institutional Review Board (IRB #70466). The IRB granted a waiver of patient consent in accordance with 45 CFR 164.512(i)(2)(ii).

Data Records

Variables and attributes

The ARMD dataset²⁶ is available at Dryad and encompasses a wide range of variables that are organized into multiple linked tables, each offering a unique perspective on a patient’s microbiological, demographic, and clinical characteristics. To facilitate downstream analyses, the dataset includes tables on implied antibiotic susceptibility relationships and rules applied for inferring susceptibility where direct testing was not available. Researchers can also leverage longitudinal data, capturing the timing of infections, prior medical procedures, and medication exposures relative to culture orders, enabling temporal analyses.

At the core of ARMD is the microbiological cultures cohort, which includes details about culture types—urine, respiratory, and blood cultures—along with the identified organisms and their antibiotic susceptibilities. Antibiotic susceptibility results were included for 55 antibiotics and categorized into five groups: susceptible, resistant, intermediate, inconclusive, and synergism. Synergism refers to cases where the interaction between two antibiotics results in an enhanced effect, meaning the combined treatment is more effective than either antibiotic alone. This category captures instances labeled as “Synergy” or “No Synergy” in the dataset. Additional features include the culture’s ordering mode (inpatient or outpatient) and the order’s timing.

The dataset situates each culture event within its clinical context. The ward information provides insights into the care environment where cultures were collected, distinguishing between inpatient wards, intensive care units (ICU), emergency departments (ED), and outpatient clinics.

To capture potential influences on culture outcomes, ARMD includes records of prior antibiotic exposures. This component details the antibiotic name, class, and subtype, enabling analyses of how previous treatments may affect organism susceptibility and resistance development. The timing of these exposures relative to culture collection is recorded, supporting studies on the impact of prior antibiotic use on resistance development. Additionally, the dataset tracks microbial resistance trends on both individual and population levels over time, recording the evolution of resistance relative to culture events for specific organisms and antibiotics. Historical infection data are captured through the inclusion of a prior infecting organism table, which documents organisms identified in previous cultures for each patient. This enables longitudinal analyses of infection recurrence and its potential influence on current antimicrobial resistance. The table records the identified organism and the timing of the prior infection relative to each collected culture.

Patient demographics offer an essential context for stratifying analyses by age (binned into predefined ranges) and sex (binary-coded). In addition, the dataset incorporates socio-environmental factors through the inclusion of ADI scores, which capture neighborhood-level socioeconomic characteristics based on patient ZIP codes from the Neighborhood Atlas²⁷. ADI scores designed for 9-digit ZIP codes account for factors such as income, education, employment, and housing quality, providing a broader context for understanding disparities in AMR risk. For records with only 5-digit ZIP codes, missing ADI scores were replaced with the average ADI score calculated from 9-digit ZIP codes sharing the same first 5 digits. For other cases with invalid or unavailable ADI scores (e.g., marked as P, U, or NA), no imputation was performed, and these entries were left as null values in the dataset.

Recognizing the role of long-term care facilities in AMR dynamics, nursing home visits are also documented, specifying the number of days between visits and culture orders, up to 90 days, to highlight potential risk factors for resistant infections.

Comprehensive laboratory data are integrated into the dataset, capturing key clinical measurements taken around the time of each culture order. Variables include white blood cell count, hemoglobin, creatinine, lactate, and procalcitonin, among other routinely collected studies. Each metric is summarized using statistical descriptors such as medians, quartiles (Q25, Q75), and first and last recorded values. Furthermore, vital sign data—including heart rate, blood pressure, temperature, and respiratory rate—provide additional clinical context, enabling analyses of physiological responses to infection.

Comorbid conditions are mapped using standardized indices such as the Elixhauser Comorbidity Index²⁸ and the Agency for Healthcare Research and Quality (AHRQ) Clinical Classifications Software Refined (CCSR)²⁹. Each comorbidity is timestamped relative to the culture. Notably, ongoing comorbidities are flagged using NULL values in the end date field, indicating that the condition was active at the time of culture collection. These NULL values do not represent missing data or the absence of the condition. Additionally, procedural history is also provided, with records of medical procedures (e.g., central venous catheter placements, mechanical ventilation) performed prior to culture orders, derived from Current Procedural Terminology (CPT) codes.

Lastly, the implied susceptibility table infers antibiotic susceptibility for drugs not tested using an extensive set of predefined rules. This table captures cases where susceptibility to one antibiotic can imply susceptibility or resistance to another, based on established microbiological and pharmacological principles. The table is designed to enhance the interpretability of susceptibility data by incorporating implied relationships between antibiotics, which can be critical for guiding clinical decision-making and understanding resistance patterns. Additionally, we share the rules applied to derive these implied relationships, providing transparency and enabling researchers to understand and reproduce the logic behind the inferred data. This derived table leverages microbiological principles to capture relationships between antibiotics.

Demographics and microbiological culture data

ARMD comprises 751,075 microbiological culture records collected from 283,715 unique patients. Urine cultures constitute the majority of samples (50.0%), blood cultures represent 38.8%, and respiratory cultures account for 11.3%. The dataset spans from December 1999 to February 2024; however, there is a noticeable increase in recorded culture orders starting in 2008. This shift aligns with Stanford’s adoption of Epic as the EHR system, which significantly improved data collection and documentation.

The patient population demonstrates a broad age distribution, as illustrated in Fig. 2, with an average age of 56.7 years. The sex distribution within the cohort reveals a predominance of female patients, accounting for 66.9% (189,864 patients) of the total population, and male patients form 33.0% (93,763 patients), while a minimal fraction (0.03%, n = 82) have an unknown sex designation.

Figure 3 presents the annual distribution of the top five most common organisms identified in urine, blood, and respiratory cultures from 2013 to 2023. In urine cultures (Fig. 3a), Escherichia coli (E. coli) is the predominant pathogen, consistently accounting for more than 60% of isolates. Klebsiella pneumoniae and Proteus mirabilis are the next most frequently detected organisms, with little variation over time. This stability in distribution indicates a consistent microbiological profile for UTIs within the cohort, consistent with established epidemiological trends nationwide^30,31,32.

In blood cultures (Fig. 3b), a more diverse range of pathogens is observed compared to urine cultures. While E. coli remains the most common pathogen, Staphylococcus aureus and coagulase-negative staphylococci are more prevalent, reflecting the tendency of gram positive cocci to cause bloodstream infections.

In respiratory cultures (Fig. 3c), Pseudomonas aeruginosa is the most frequently isolated pathogen, possibly related to selection bias among patients who underwent respiratory culture testing from either non-invasive (e.g., induced sputum) or invasive (e.g., bronchoalveolar lavage) methods. A distinction between mucoid and non-mucoid Pseudomonas aeruginosa is observed, likely reflecting changes in microbiology reporting standards. Mucoid strains are clinically significant, particularly in chronic respiratory infections. Other notable organisms include Klebsiella pneumoniae and Staphylococcus aureus, both of which remain stable contributors to respiratory infections throughout the study period.

Technical Validation

To maintain data integrity, we minimized structural changes during transformation, preserving the clinical semantics of the original EHR data. Organism names, antibiotic identifiers, and susceptibility labels were standardized across records to resolve inconsistencies arising from varied reporting practices over the dataset’s 15-year span. Although the dataset spans from 1999 to 2024, structured and consistent EHR documentation began with the adoption of Epic in 2008. Therefore, most validations focus on data from the 15-year period beginning in 2008.

To validate dataset completeness and accuracy, we performed internal cross-checks on key variables, including culture positivity, organism-antibiotic susceptibility pairings, and linkage across patient demographics, clinical history, and laboratory results. Descriptive statistics were computed to assess expected distributions of age, sex, culture types, and pathogen prevalence, all of which aligned with established epidemiological benchmarks.

Version-controlled scripts were used throughout the data processing pipeline. All transformation logic, including implied susceptibility rules, is documented and provided alongside the dataset to support reproducibility and community validation. Issues identified during development were tracked collaboratively and resolved through iterative testing with domain experts in infectious diseases, clinical microbiology, and clinical informatics.

Usage Notes

This dataset supports studies in several critical areas, including AMR trend analysis, the development of predictive models for empirical antibiotic selection, and the examination of clinical and environmental factors that influence resistance patterns. The inclusion of granular data on culture positivity, organism identification, antibiotic susceptibility, prior medication exposures, comorbidities, and nursing home visits allows for detailed epidemiological analyses and modeling of resistance risk factors. Resistant isolates were not biobanked by the Stanford Health Care clinical microbiology laboratory and are not available for external laboratory access. A README file is included in the Dryad repository to guide users in navigating the dataset structure and contents. While regular updates are not planned, future revisions will be versioned and documented in the repository.

Handling missing data

Empty fields within the dataset are explicitly marked as “null” to maintain clarity. Users are advised to handle these values appropriately during analysis, particularly when conducting statistical modeling or machine learning tasks.

Code availability

No pre-packaged analysis scripts are included with the dataset. However, the structured format of the CSV files supports seamless integration with common data analysis tools, such as Python (pandas, scikit-learn, etc), R, SPSS, or SAS. Users requiring guidance on analytical workflows can contact the dataset authors for further support.

References

CDC. Antimicrobial Resistance Facts and Stats. Antimicrobial Resistance https://www.cdc.gov/antimicrobial-resistance/data-research/facts-stats/index.html (2025).
Murray, C. J. L. et al. Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. The Lancet 399, 629–655 (2022).
Article CAS Google Scholar
CDC. 2019 Antibiotic Resistance Threats Report. Antimicrobial Resistance https://www.cdc.gov/antimicrobial-resistance/data-research/threats/index.html (2025).
Tenover, F. C. Mechanisms of antimicrobial resistance in bacteria. Am J Med 119, S3–10, discussion S62-70 (2006).
Article CAS Google Scholar
McManus, M. C. Mechanisms of bacterial resistance to antimicrobial agents. Am J Health Syst Pharm 54, 1420–1433 quiz 1444–1446 (1997).
Article CAS Google Scholar
Carestia, M. et al. A novel, integrated approach for understanding and investigating Healthcare Associated Infections: A risk factors constellation analysis. PLoS One 18, e0282019 (2023).
Article CAS Google Scholar
Heumos, L. et al. An open-source framework for end-to-end analysis of electronic health record data. Nat Med 30, 3369–3380 (2024).
Article CAS Google Scholar
Hou, J. et al. Generate Analysis-Ready Data for Real-world Evidence: Tutorial for Harnessing Electronic Health Records With Advanced Informatic Technologies. J Med Internet Res 25, e45662 (2023).
Article Google Scholar
Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 10, 1 (2023).
Article CAS Google Scholar
National Database of Antibiotic Resistant Organisms (NDARO) - Pathogen Detection - NCBI. https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/.
Alcock, B. P. et al. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research 48, D517–D525 (2020).
CAS Google Scholar
NARMS Now. https://wwwn.cdc.gov/narmsnow/.
ResistanceMap. https://resistancemap.onehealthtrust.org/.
Sontag, D. AMR-UTI: Antimicrobial Resistance in Urinary Tract Infections Dataset. MIT Clinical ML https://clinicalml.org/data/amr-dataset/.
Chang, A. & Chen, J. H. BSAC Vanguard Series: Artificial intelligence and antibiotic stewardship. Journal of Antimicrobial Chemotherapy 77, 1216–1217 (2022).
Article CAS Google Scholar
Corbin, C. K. et al. Personalized antibiograms for machine learning driven antibiotic selection. Commun Med 2, 1–14 (2022).
Article Google Scholar
Corbin, C. K., Medford, R. J., Osei, K. & Chen, J. H. Personalized Antibiograms: Machine Learning for Precision Selection of Empiric Antibiotics. AMIA Jt Summits Transl Sci Proc 2020, 108–115 (2020).
Google Scholar
Cooper, L. N. et al. Socioeconomic Disparities and the Prevalence of Antimicrobial Resistance. Clinical Infectious Diseases 79, 1346–1353 (2024).
Article CAS Google Scholar
Haredasht, F. N. et al. Enhancing Antibiotic Stewardship: A Machine Learning Approach to Predicting Antibiotic Resistance in Inpatient Care. AMIA Annu Symp Proc 2024, 857–864 (2025).
Google Scholar
STARR OMOP | Observational Medical Outcomes Partnership | Stanford Medicine. https://med.stanford.edu/starr-omop.html.
Electronic Health Record | STAnford medicine Research data Repository. https://starr.stanford.edu/data-types/electronic-health-record.
M100 Ed35 | Performance Standards for Antimicrobial Susceptibility Testing, 35th Edition. Clinical & Laboratory Standards Institute https://clsi.org/standards/products/microbiology/documents/m100/.
M45 Ed3 Test Infrequently Isolated/Fastidious Bacteria. Clinical & Laboratory Standards Institute https://clsi.org/standards/products/microbiology/documents/m45/.
eucast: EUCAST. https://www.eucast.org/.
Datta, S. et al. A new paradigm for accelerating clinical data science at Stanford Medicine. Preprint at https://doi.org/10.48550/arXiv.2003.10534 (2020).
Nateghi Haredasht, F. et al. Antibiotic Resistance Microbiology Dataset (ARMD). 24479454832 bytes Dryad https://doi.org/10.5061/DRYAD.JQ2BVQ8KP (2025).
Kind, A. J. H. & Buckingham, W. R. Making Neighborhood-Disadvantage Metrics Accessible — The Neighborhood Atlas. N Engl J Med 378, 2456–2458 (2018).
Article Google Scholar
Elixhauser, A., Steiner, C., Harris, D. R. & Coffey, R. M. Comorbidity measures for use with administrative data. Med Care 36, 8–27 (1998).
Article CAS Google Scholar
Clinical Classifications Software Refined (CCSR).
Manning, W. D., Longmore, M. A. & Giordano, P. C. Cohabitation and Intimate Partner Violence during Emerging Adulthood: High Constraints and Low Commitment. J Fam Issues 39, 1030–1055 (2018).
Article Google Scholar
Sintsova, A. et al. Genetically diverse uropathogenic Escherichia coli adopt a common transcriptional program in patients with UTIs. eLife 8, e49748.
Subashchandrabose, S. et al. Host-specific induction of Escherichia coli fitness genes during human urinary tract infection. Proc Natl Acad Sci USA 111, 18327–18332 (2014).
Article ADS CAS Google Scholar

Download references

Acknowledgements

This work was supported by the National Institute of Allergy and Infectious Diseases (NIAID) under NIH R01 Grant R01AI179155 from the National Institutes of Health. We also acknowledge the contributions of our collaborators at Stanford University, UT Southwestern Medical Center, Harvard Medical School, and Harvard Pilgrim Healthcare Institute for their support in data harmonization, validation, and scientific guidance throughout this project.

Author information

Authors and Affiliations

Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
Fateme Nateghi Haredasht, Fatemeh Amrollahi & Jonathan H. Chen
Division of Pulmonary and Critical Care Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
Manoj V. Maddali
Division of Pediatric Infectious Diseases, Department of Pediatrics, Stanford University School of Medicine, Palo Alto, CA, USA
Nicholas Marshall
Division of Hospital Medicine, Stanford University, Stanford, CA, USA
Stephen P. Ma & Jonathan H. Chen
Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
Lauren N. Cooper & Richard J. Medford
Information Services, East Carolina University, Greenville, NC, USA
Andrew O. Johnson
Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Healthcare Institute, Boston, Massachusetts, USA
Ziming Wei & Sanjat Kanjilal
Brody School of Medicine, Department of Internal Medicine, East Carolina University, Greenville, NC, USA
Richard J. Medford
Division of Infectious Diseases and Geographic Medicine, Stanford University School of Medicine, Stanford, CA, USA
Niaz Banaei, Stanley Deresinski & Amy Chang
Department of Pathology, School of Medicine, Stanford University, Palo Alto, CA, USA
Niaz Banaei
Department of Health Policy, Stanford University School of Medicine, Stanford, CA, USA
Mary K. Goldstein
Division of Primary Care and Population Health, Stanford University School of Medicine, Stanford, CA, USA
Steven M. Asch
Clinical Excellence Research Center, Stanford University, Stanford, CA, USA
Jonathan H. Chen

Authors

Fateme Nateghi Haredasht
View author publications
Search author on:PubMed Google Scholar
Fatemeh Amrollahi
View author publications
Search author on:PubMed Google Scholar
Manoj V. Maddali
View author publications
Search author on:PubMed Google Scholar
Nicholas Marshall
View author publications
Search author on:PubMed Google Scholar
Stephen P. Ma
View author publications
Search author on:PubMed Google Scholar
Lauren N. Cooper
View author publications
Search author on:PubMed Google Scholar
Andrew O. Johnson
View author publications
Search author on:PubMed Google Scholar
Ziming Wei
View author publications
Search author on:PubMed Google Scholar
Richard J. Medford
View author publications
Search author on:PubMed Google Scholar
Sanjat Kanjilal
View author publications
Search author on:PubMed Google Scholar
Niaz Banaei
View author publications
Search author on:PubMed Google Scholar
Stanley Deresinski
View author publications
Search author on:PubMed Google Scholar
Mary K. Goldstein
View author publications
Search author on:PubMed Google Scholar
Steven M. Asch
View author publications
Search author on:PubMed Google Scholar
Amy Chang
View author publications
Search author on:PubMed Google Scholar
Jonathan H. Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

F.NH., F.A. and M.V.M. built the ARMD dataset. All authors gave input into the database development process and contributed to writing the paper.

Corresponding author

Correspondence to Fateme Nateghi Haredasht.

Ethics declarations

Competing interests

J.H.C. reported being a co-founder of Reaction Explorer LLC, develops and licenses organic chemistry education software, paid consulting fees from Sutton Pierce, Younker Hyde MacFarlane, and Sykes McAllister as a medical expert witness, and paid consulting fees from ISHI Health. F.N.H. reported being paid consulting fees from ISHI Health. The remaining authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Nateghi Haredasht, F., Amrollahi, F., Maddali, M.V. et al. Antibiotic Resistance Microbiology Dataset (ARMD): A Resource for Antimicrobial Resistance from EHRs. Sci Data 12, 1299 (2025). https://doi.org/10.1038/s41597-025-05649-7

Download citation

Received: 26 March 2025
Accepted: 17 July 2025
Published: 26 July 2025
Version of record: 26 July 2025
DOI: https://doi.org/10.1038/s41597-025-05649-7