Background & Summary

Many areas of observational health research and clinical informatics research rely on accurate and complete race and ethnicity (RE) patient information, particularly for estimating disease risk1,2,3, assessing quality and performance metrics4, and identifying health disparities5,6,7,8. The electronic health record (EHR) provides a rich source of patient health data, but while RE is often stored in easily accessible structured EHR fields, this format often suffers from missing, inadequate, or inaccurate information9,10,11,12,13,14,15,16. For example, Polubriaginof et al. found missing RE data affected 25% of patients with data in large observational health databases and 57% of patients at a large academic medical center in New York City9. Finally, while there has been an acknowledged need for more granular information such as preferred language, this data is often not recorded in EHR systems17. Overall, missing RE data decreases a patient’s visibility within research and healthcare systems and can affect the allocation of resources in hospitals and health systems to best serve health equity goals. When done carefully and thoughtfully, the ability to supplement missing or inaccurate RE data is an important step toward increasing the diversity of patients represented in observational health research and supporting health equity, by filling in a common unobserved confounder. Further, the Affordable Care Act and other federal laws now include non-discrimination clauses18, compliance with which can only be assessed with adequate RE data at the patient- and provider-levels. Similarly, algorithmic bias, fairness, and recourse can trickle down into clinical guidelines starting from reported differences between races in epidemiological statistics such as prevalence19. Finally, given the rise of large language models in clinical care, operations, and revenue cycle management20, the training data available can lead to biased output that can impact clinical care and a hospital’s revenue. It is well known in the algorithmic fairness research field that there is no way to create risk scores that are fair with respect to the intersection of multiple legally-protected classes (such as a readmission risk score prediction algorithm that is fair with regard to subgroups defined using both race and sex21). Given that labelled training data is difficult to come by and federal laws such as HIPAA restrict the transmission of large language model parameters, reference standard RE annotations with high inter-rater reliability are a significant opportunity to ensure informed consent for clinicians, patients, hospitals, and health systems that use tools such as large language models to inform triage, decision-making, resource allocation, and revenue.

Inadequate RE categories can also mask important subgroup differences, in part due to a lack of sufficient granularity13,22,23,24,25,26,27,28. While still important for federal data reporting, concerns of inadequate data have generally revolved around the Office of Management and Budget’s (OMB) five race categories (American Indian or Alaska Native, Black or African American, Asian, Native Hawaiian or Other Pacific Islander, white) and two ethnic categories (Hispanic or Latino and Not Hispanic or Latino)29. The Institute of Medicine’s landmark report, Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care, expressed concern over the lack of more granular RE categories and how this hinders health disparities research13. Furthermore, research has called for adapting RE categories to a multiracial and multi-ethnic U.S. population. Without access to more granular information, current OMB category standards obscure subpopulations that can have distinct healthcare needs30,31,32,33. More granular information can highlight country or geographic region of origin and level of language proficiency and help distinguish differences within these broader categories. For example, major differences have been observed among people of Asian descent in the U.S. with respect to access to mental healthcare31 and cancer incidence32 based on differences in English language proficiency and country of origin, respectively. More granular race information can also be used to uncover disparities in clinical risk scores that would otherwise be concealed with coarse race groups34.

Clinical text often provides a rich, unstructured source of granular information related to RE, such as immigration status35, country of origin36, and preferred language17. Natural language processing (NLP) models can be trained to identify RE in clinical text to supplement and/or complement structured RE data that is missing, inaccurate, or lacking in granularity. For example, Sholle et al. developed a rule-based approach to extract RE categories from clinical text and achieved excellent performance for identifying Black and Hispanic patients37. Within the setting of a hospital in an affluent neighborhood of New York City, Sholle et al. found that clinical notes could increase positive documentation of RE data by upwards of 20% for Black and/or Hispanic patients with previously missing RE data37. One major challenge to training NLP models to identify RE data from clinical text, however, is the need for reference standard annotations, which can be costly and time-consuming to create. Publicly available reference standard annotations can support these tasks and future research on patterns of clinical documentation of RE in clinical text.

We present the Contextualized Race and Ethnicity Annotations for Clinical Text (C-REACT) dataset, two sets of publicly available reference standard annotations on 17,281 sentences from 12,000 patients from the MIMIC-III dataset, a corpus extracted from critical care units at Beth Israel Deaconess Medical Center between 2001 and 201238. The first set of annotations provides granular detail on RE at the span level within sentences. The second set of annotations are physician-assigned RE labels at the sentence level. Both annotation sets and their guidelines are made available to the research community to enable widespread use of more granular RE-related information in clinical notes and demonstrate how NLP can be leveraged to infer RE using clinical text.

While other datasets for RE exist, such as such as The Home Mortgage Disclosure Act data (https://www.consumerfinance.gov/data-research/hmda/historic-data/), race imputation using name information39, or aggregate clinical trials data (ClinicalTrials.gov), clinical note datasets for RE are difficult to make public given the privacy risks involved for individual patients. Research through the National NLP Clinical Challenges does offer access to clinical text annotations for variety of tasks but does not include RE annotations at the level provided by C-REACT. Given that MIMIC-III is the only publicly accessible clinical dataset (with the required credentials) that combines patient data including clinical notes, labs, diagnosis codes, demographics, medications, and procedures on 59,652 patients, the addition of C-REACT to MIIMIC-III can greatly enhance NLP research into RE extraction from clinical notes.

Methods

In this section, we describe our approach to annotate clinical text in two ways (1) at the span level for RE-related information (i.e., RE indicators) and (2) at the sentence level for RE assignment. We then describe our analysis of the presence of indicators within sentences in relation to RE assignments.

Data and pre-processing

We extracted all sentences from 59,652 discharge summary clinical notes for 41,127 patients from the MIMIC-III dataset38 (version 1.4). We used NLTK40 to extract sentences and heuristics to handle clinical lists such as medication and condition lists. Sentences likely to contain RE information were identified using keywords related to patient demographics (e.g., “male”, “female”, “patient”) and section headings (e.g., “Past Medical History”, “PMH”, “Social History”, “SHX”). Case was ignored for these keywords. From the entire set of discharge summary sentences, those with demographic keywords and/or section headers were extracted as the candidate corpus in Table 1 (n=794,841). The comprehensive list of section header keywords used is “sshx, “social history”, “social hx”, “pmh”, “past medical history”, “pmhx”, “hpi”, and “history of present illness”. Sentences with RE keywords (e.g., “Black”, “AA”, “Native American”, “Hispanic”, “Spanish”) were prioritized for explicit indicator span annotation (Table 2). Section heading, patient demographic, and RE keyword identification were not case sensitive. We conducted two separate annotation processes (one for RE indicators, one for labels) using 17,281 sentences sampled from the 794,841 sentences. This corpus comprised 13,507 notes for 12,000 patients. We refer to the corpus of 17,281 sentences as the central corpus since it is used in both RE indicator and label annotation phases. Table 1 provides more details on the central corpus.

Table 1 Summary statistics for indicators in the corpus of sentences with demographic-related keywords and/or section headers (candidate corpus) and the corpus annotated for indicators and RE labels (central corpus). The central corpus was sampled from the candidate corpus. In the table, parentheses show percentages for the appropriate corpus in each column.
Table 2 Definitions of main terms used throughout this article. RE refers to race and ethnicity. The comprehensive lists for demographic and RE keywords are also included and were drawn directly from37.

Sentence sampling process

Of the 794,841 sentences extracted from MIMIC-III, 17,281 were sampled to create the central corpus. All sentences extracted contained section headers and/or demographic keywords. Sentences can be split into three main categories: (1) those with RE keywords (RE matches), (2) those with section headers (headers), and (3) those with only demographic keywords (dems). Sentences were sampled randomly within each category iteratively until sentences with RE keywords or headers were exhausted. More specifically, RE matches were randomly sampled at a rate of 50%, headers at 25%, and dems at 25%. Given that dems is the only mutually exclusive category, headers and matches have significant overlap and thus the percentages in Table 1 do not exactly reflect the sampling percentages. As the RE matches and headers categories overlap, we differentiate between sentences with RE keywords AND NO section headers, RE keywords AND section headers, and section headers AND NO RE keywords (Table 1). Sentences containing RE keywords and/or section headers were prioritized for sampling as we hypothesized that these were likely to contain RE discussions. While RE discussions were hypothesized to be rare in sentences with only demographic keywords, we randomly sampled from this subset to reduce bias in our dataset. More specifically, sentences likely to have positive RE labels are important for training, but a diversity of sentences without positive labels are also important as these are the most common sentences researchers will encounter in real-world settings.

Explicit span-level indicator annotation process

The explicit indicator annotation process was performed by non-physicians (n=4), and focused on identifying spans of text that explicitly convey RE-relevant information. All 17,281 sentences in the central corpus were annotated for RE indicators. We chose four categories of indicators to capture explicit spans of text potentially describing RE: (1) spans of text that discuss country/nation of origin or geographic ancestry (country); (2) spans of text that discuss primary, preferred, or spoken language (language); (3) spans of text that discuss direct race mentions (race); and (4) spans of text that discuss direct ethnicity mentions (ethnicity). It is important to note that the race and ethnicity indicators follow U.S.-centric definitions of race and ethnicity29. Examples of these four indicators can be found in Table 3. The annotation guidelines for indicators are available in the PhysioNet dataset under the file name “File_1_Annotation_Guidelines_for_Race_and_Ethnicity_Indicators.docx”. Annotations were conducted using the software Prodigy(https://prodi.gy/)41.

Table 3 Descriptions and examples for the four race and ethnicity indicators (country, language, race, ethnicity).

Macro F1 was used to measure inter-annotator agreement instead of the more traditional Cohen’s kappa, given that the annotation task is at the span-level42,43. The F1-score is defined as the harmonic mean between precision and recall. Given that the number of negative cases is unknown, the F1-score is more appropriate than Cohen’s kappa when annotating spans of text42. Additionally, this measure was computed for exact span matches rather than across tokens. Aggregating the F1-score across indicator classes was performed using macro F1 as it is more sensitive to imbalanced data than micro F1. When calculating macro F1 scores for a pair of annotators, one annotator was treated as the reference standard and the other annotator was compared to their annotations. For a pair of annotators, macro F1 scores are the same regardless of which annotator is chosen as the reference standard annotator. All sentences were double-coded and annotators iteratively updated guidelines until sufficient agreement was present (>0.85 macro F1). While iterating, all annotators converged as a group to discuss any sentences with different annotations and settle on the correct annotation to be used. If necessary, the annotation guidelines would be updated to prevent the potential for similar disagreements in the future. When a pair of annotators reached sufficient agreement, they were allowed to annotate independently and continue to provide input on the annotation guidelines. The annotation guidelines are publicly available to support reproducibility.

Assigning race and ethnicity labels to sentences

The RE labeling process was conducted by physicians (n=10) with a medical degree and at least one year of post-graduate residency experience. All sentences from the central corpus were annotated by at least one physician for RE labels. Physicians were provided with two subsets of sentences to annotate: one subset (n=5,834) contained sentences that all physicians annotated independently (shared annotation subset), and the other subset (n=11,447) contained the remainder of the sentences split evenly among physicians to annotate (single annotation subset). For a graphical depiction of this information please see Fig. 1. Each physician annotated approximately 1,144 sentences in their single annotation subset. The shared annotation subset contained 4,834 sentences with at least one positive RE indicator span annotation and 1,000 sentences randomly sampled from sentences without any RE indicator. It was later determined that one of these 1,000 sentences contained a positive indicator span that was previously missed; the final data reflects this change. None of the sentences in the single annotation subset contained any positive indicators based on physician review during the indicator annotation step. It was later determined that five sentences in the single annotation subset contained positive indicator spans that were previously missed; the final data reflects this change. The annotation guidelines for RE categories are available in the PhysioNet dataset under the file name “File_2_Annotation_Guidelines_for_Race_and_Ethnicity_Assignments.docx”. We divided sentences such that the shared annotation subset contained all the sentences with known indicators because we hypothesized these sentences to have the vast majority of the positive RE labels and thus require multiple annotators to confirm a positive label. The single annotation subset contained no known indicators and we hypothesized that it would have very few positive RE labels. Our hypotheses were confirmed (see the Technical Validation section for more details).

Fig. 1
Fig. 1
Full size image

Subsets of sentences to be annotated for race and ethnicity labels from the central corpus. More green means that more race and ethnicity (RE) indicator spans are included in the subset.

In the interest of exploring the relationship between RE labels and explicit indicator annotations, we only provided physicians with sentences from the central corpus and did not include explicit indicator annotations performed by non-physicians. We also provided physicians with sentence-level RE labeling guidelines, which emphasized that the entire content of the sentence could be used to infer RE labels, and physicians could rely on any previously acquired knowledge related to the sentence (e.g., clinical training or life experience) to infer RE if they so desire. Physicians were not provided with other information about the patient (e.g., patient identifiers). The RE label set consists of positive and negative labels. The positive labels included the U.S. census categories and an additional label, “Not Covered”, to signal the presence of a RE category that falls outside of the census categories (Table 4). A negative label, “No Information Indicated”, was included to explicitly convey that a sentence did not contain sufficient, if any, RE information to make a positive assignment. Multiple labels per sentence were permitted, and each sentence had to be assigned at least one race and at least one ethnicity label (either positive or negative). More specifically, annotators could assign one or more labels, excluding the negative label. The negative label “No Information Indicated” could be used when annotators felt there was not enough information to make a positive assignment. This follows the structured data in MIMIC-III, which also allows for more than one RE category to be documented for a patient. The negative category “No Information Indicated” allows annotators to abstain from assigning a positive label for any sentence. In addition to the 10 individual annotations per sentence in the shared annotation subset, we also kept track of global annotation labels composed of the majority vote (n >=5) among the 10 physicians. We assigned a RE label to a sentence if five or more physicians agreed on the assignment through their annotations. Multiple RE categories were allowed for each sentence. We measured physician agreement using Cohen’s Kappa for each category in the RE labeling task. Agreement for each physician’s annotations was measured against a leave one out (LOO) majority vote corpus, created using the remaining nine annotators.

Table 4 Categories used for sentence-level race and ethnicity labeling.

During our analysis, we also consolidated our sentence-level RE assignments up to the patient level to compare structured and unstructured sources of RE information. All sentence-level assignments were combined for a patient using a union operation over all sentence-level RE labels. We assigned a patient an RE label if any of their sentences were assigned that label (through the majority vote for the shared annotation sentences and a single vote for the single annotation sentences). Similarly, patient-level structured data in MIMIC-III were combined for each patient using the union operation. MIMIC-III only provides “Ethnicity” information on patients, which captures both race and ethnicity categories, and was mapped to the RE categories used in this work. MIMIC-III Ethnicity category mappings follow the definitions provided in Table 4. A complete mapping can be found in the MIMIC-III mapping table in the PhysioNet dataset under the file name “File_3_MIMIC_Ethnicity_Category_Mapping.docx”.

We acknowledge the nuance, diversity, and history that the terms in Table 4, and their respective abbreviations, are unable to convey. In particular, we want to highlight the terms Hispanic, Latino, Latina, and Latinx. We originally used these terms during the annotation process to be as inclusive as possible with this category, though we recognize that not everyone assigned to this category would equally identify or agree with each term44. Each term has its own unique limitations. For example, the term, Hispanic, ignores ancestries of non-Spanish origin45; use of Latino/Latina implies a gender binary46,47; and the umbrella term, Latinx, is an Anglicization47. In this work we have opted to use the term “Latino” in the interest of brevity to refer to the broader grouping that includes Hispanic, Latino, Latina, and Latinx (and non-Latino for its compliment, which includes non-Hispanic, non-Latino. non-Latina, and non-Latinx) as Latino is the preferred term from the AP Stylebook and can refer to people from (directly or ancestrally) Spanish-speaking lands or cultures as well as Latin America48. We acknowledge that this term may not be the preferred term among all patients49. While this concern applies to all racial and ethnic categories used in this research, it is particularly important for the Latino category as we are actively choosing to use one term over others.

In this research, we used federally recognized RE categories while performing sentence-level RE label annotation. Federally recognized RE categories in the U.S. census have recognized limitations13. However they continue to be widely used in research and hospital quality reporting metrics and thus they continue to have value for research. Additionally, given that each sentence is annotated for RE labels and indicators, the indicators can provide information on nuance that the U.S. census-derived RE labels are missing.

Data Records

Both sets of reference standard annotations are available on PhysioNet50. While PhysioNet51 data are freely accessible, users are required to register, complete a credentialing process, and sign a data use agreement. The C-REACT Dataset project page on PhysioNet provides further details on the dataset and the access application process(https://doi.org/10.13026/t9ka-6k29)50.

There is one main folder containing two jsonl files with span-level RE indicators and sentence-level RE assignments as well as a subdirectory with raw RE assignment files. All records contain a sentence identifier (sentence_id) that can link to sentences in other files. The rest of this section outlines the columns of each file.

The ‘indicators_df.jsonl’ file (17,281 sentences from the entire central corpus) contains information from the original MIMIC-III NOTEEVENTS file, including identifiers for visits (visit_id), patients (patient_id), and notes (note_id). Output from the annotation software Prodigy (https://prodi.gy/) (spans) is also available in the file. Finally, tokenized sentence (text) and tokens (tokens) are presented.

The ‘all_re_assignments_df.jsonl’ file (17,281 sentences from the entire central corpus) contain sentence (text) and sentence identifier (sentence_id) columns. Additionally, the files contain binarized columns for all race and ethnicity categories described in Table 4. Each racial category has ‘RACE’ as a prefix, while ethnicity categories have the prefix ‘ETH’. The final two binary columns (shared_subset, single_subset) indicate the source of the RE assignments as either the shared or single annotation subset respectively. All shared annotation subset assignments are majority vote assignments.

We also provide the raw RE assignment files for each annotator in a second folder. The folder structure for each annotator is the same. Each annotator’s folder contains six xlsx files with ‘all_clinician_sentences’ as a prefix and numbered 0–5 (shared annotation subset). Within this folder there is another folder containing the single annotation subset xlsx file. All xlsx files follow the format outline in Fig. 1 of the File_2_Annotation_Guidelines_for_Race_and_Ethnicity_Assignments.docx file on PhysioNet and contain information on the sentence identifier (ID), sentence text (Sentence), and all RE categories previously described. Annotators marked their annotations by simply adding any character (often ‘x’) to a cell in the xlsx files.

Technical Validation

We present validation results for the RE labels (sentence-level) and indicator annotations (span-level). For RE category annotations, we measured physician-annotator agreement for the RE labels and concordance between structured and unstructured sources for patient-level RE information. For RE indicator annotations, we measured inter-annotator agreement and the proportion of sentences with RE positive labels but no positive RE indicator annotation.

Validating race and ethnicity indicator span-level annotations

Indicators were double coded until all annotator pairs reached sufficient agreement of >0.85 macro F1. Until sufficient agreement was reached, all disagreements were adjudicated through discussion and the annotation guidelines were iteratively updated. Then, we measured how often the majority vote RE labels were assigned to sentences that did not contain at least one indicator annotation. Out of all 12,411 sentences without positive indicator annotations, only six were assigned a positive race and ethnicity label. In other words, six out of 4,811 sentences (0.1%) sentences were assigned a positive RE label but did not contain a positive indicator. All six sentences occurred in the individual annotated subset and contained spans of text that were not considered indicative of race and/or ethnicity in this work, i.e., cuisine, occupation, immigration status, and wars/conflicts. These results provide evidence for high agreement on indicator annotations and confirm that our indicators covered a vast majority of RE discussions in the corpus.

Validating race and ethnicity label annotations

In the sentence-level RE labeling task on the 5,834 shared annotation subset sentences, physicians had moderate to strong agreement (Cohen’s kappa >0.61) when compared to the LOO majority vote assignments. When averaging an annotator’s kappa scores for RE categories with more than 300 assignments, half the annotators had almost perfect agreement (Cohen’s kappa >0.81) agreement and all annotators had substantial agreement (0.61–0.80)52. Perfect agreement is indicated using bold font. Overall, clinical annotators had the lowest agreement for the non-Latino and the “No Information Indicated” categories for both race and ethnicity (Table 5). Another table with identical information and shaded agreement scores is included in Supplementary Table 1 in the supplementary file.

Table 5 Cohen’s kappa for race and ethnicity categories with at least 300 sentences assigned.

The majority vote sentence-level RE assignments represented 4,575 patients, whose RE labels could be compared to structured RE sources data in MIMIC. This subset of patients will be used to compare RE data in MIMIC-III. For this analysis, race and ethnicity categories were collapsed to match MIMIC-III’s single “Ethnicity” column that contains race and ethnicity categories. Merging RE categories from the MIMIC-III demographics table found in the “File_3_MIMIC_Ethnicity_Category_Mapping.docx” from PhysioNet, a total of 3,931 (85.9%) patients had at least one positive race or ethnicity category in structured data, and 527 (11.5%) patients had race or ethnicity information recovered through unstructured data leading to 4,458 patients (97.4%) with at least one positive race or ethnicity category. Looking specifically at positive race categories, a total of 4,114 (89.9%) patients had race data from structured and/or unstructured sources, with 772 (16.9%) patients being unique to structured and 489 (10.7%) being unique to unstructured (left-hand side of Fig. 2). Of the 489 patients with race information recovered through text, most patients had race information related to being white (n=402), Black or African American (Black/AA) (n=40), and Asian (n=39). Positive ethnicity categories were missing more often than positive race categories, with only 390 (8.5%) patients with positive ethnicity information from any source, 38 (0.8%) patients unique to structured, and 80 (1.7%) unique to unstructured (right-hand side of Fig. 2).

Fig. 2
Fig. 2
Full size image

Venn diagram comparing structured and unstructured data sources in their overlap of labels associated with patient race (left) and ethnicity (right) data. Labels concerning race (i.e., Black or African American, Native American or Alaskan Native, Native Hawaiian or Other Pacific Islander, Asian, white, Not Covered) were noted among 4,114 patients, while labels related to ethnicity (i.e., Latino, Not Covered) were attributed to 390 patients.

From the same subset of 4,575 patients with majority vote sentence-level assignments, a significant number of patients had both structured and unstructured data for either ethnicity or race, for which there was high concordance. Of those patients who had structured and unstructured race data (n=2,853), 98.9% had at least partial agreement between the two sources (n=2,821), with the vast majority (n=2,819) of those agreements being perfect. All 272 patients with structured and unstructured data for the Latino category had perfect agreement between the two data sources. This agreement provides evidence RE inferences align with other sources for RE information in MIMIC-III. A small number of patients (n=5) had multiple RE categories documented from the structured and/or unstructured data. Our analysis allows for multiple categories to be documented for a patient much like the structured RE data in MIMIC-III.

We examined the discharge summaries of patients without RE data derived from clinical notes but had structured race or ethnicity data. For the 772 patients missing unstructured race data, most notes examined did not have any mentions of race indicators in the clinical notes or the sentences annotated from those notes. There were some cases of misspellings that were not handled by our regular expressions (e.g., “intertpreter [sic]”), and there were 55 patients who had de-identified country (e.g., “[**country 456**]”) information that was not used to assign patients any RE categories. Finally, there were patients who had race data from a few annotators, but not enough to meet the majority vote requirement and thus received no positive assignment. Similarly, for the 38 patients with only structured ethnicity data, nine had sentences with annotator assignments that failed to meet the majority vote and 12 had de-identified country mentions. For both race and ethnicity assignments, patients had sentences where most annotators assigned a positive race or ethnicity label, but did not agree on which label to use and so no label was assigned in the majority vote. For example, annotators labeled a sentence identifying a patient as Egyptian in the clinical note, as a positive indicator for Black/AA, Asian, white, and Not Covered.

Validating low annotator agreement sentences

To better understand why certain sentences had low agreement between annotators for RE assignments, OBDW manually sampled and inspected low agreement sentences for each RE category. Within an RE category, all low agreement sentences were used if there were fewer than 100, otherwise, sentences were sampled with priority given to sentences with lower agreement (Fig. 3).

Fig. 3
Fig. 3
Full size image

Sampling tiers for sentences with low annotator agreement. Within the triangle are the number of votes in each tier, with the sampling percentage on the left side. For each tier, sentences were sampled up to 10% of the total low agreement sentences for a given RE category. Tiers are symmetric, represented by the left and right-side numbers within the triangle. For example, a sentence with 3 votes for a given category has 7 annotators who did not vote for that category, while a sentence with 7 votes for a category has 3 annotators who did not vote for that category.

Table 6 summarizes general observations for sentences with low agreement across RE categories. RE categories with no low agreement sentences are excluded from the table. Generally, certain indicators specific to each category commonly occurred in low agreement sentences. The reasons for which indicators could lead to low agreement are likely specific to each category. Another potential reason for diverse labeling opinions was how de-identified country mentions were interpreted or used. While specific RE categories could sometimes be inferred, No Information Indicated and Not Covered were common choices for sentences with de-identified indicators.

Table 6 Observations of potential reasons for low agreement for sampled sentences with low annotator agreement across RE categories.

As previously noted, the two ethnicity categories non-Latino and No Information Indicated had lower agreement than other RE categories (Table 5). From the low agreement sentences for non-Latino we observed that annotators were split in how to use direct race mentions (e.g., “black”, “white”, “AA”) and certain language and country indicators (e.g., “French Creole”, “Chinese”) as either uninformative for or indicative of a patient being non-Latino. The category No Information Indicated often contained votes from annotators who might not have used the previously discussed direct race mentions to infer that someone is non-Latino. Additionally, there were multiple de-identified mentions under no information indicated, which could point to the need for a modified “Not Covered” category that is explicitly designed for de-identified text. For these two lower agreement categories (non-Latino and No Information Indicated), annotators seem to be split on how to use information to infer ethnicity that are not considered direct ethnicity mentions or language mentions like “Spanish”.

Low agreement examples from the non-Latino category, could indicate that there is room for interpretation on what kinds of phrases can be used to infer that someone is non-Latino. Previous research has noted that Latino patients often do not identify with OMB-defined race categories used in this work9,53, and it is possible that a similar phenomenon is occurring here for the category non-Latino. More specifically, certain physician annotators could have different views on how OMB-defined race and ethnicity categories should inform one another and view Latino/non-Latino as incongruent with certain racial information or just fluidly defined53,54,55,56. This can happen when hospital workers do not feel adequately trained to collect RE data54. The raw annotation data provides insight into the spectrum of potential interpretations by physicians.

Validating potential false negative RE assignment annotations

To validate potential false negatives, we first examined sentences with no majority vote positive RE assignment that included an indicator. Second, we examined false positive mistakes made by deep learning models trained to identify sentences with RE information. All sentences were examined by OBDW.

When examining sentences with indicators but no majority vote positive RE assignment, it was observed that de-identified tokens and diverse opinions on using indicators to infer RE were most common. De-identification issues aren’t necessarily false negatives, but rather limitations of the data and the absence of a category that explicitly handles de-identified RE information. The closest false negatives are the diverse opinions between physicians on how to use country or language indicators to infer RE. Examples of indicators include “Canada”, “Cantonese”, and “Portuguese”. Some of these examples may not actually be false negatives. For example, one sentence contained “Canada” as the only indicator and most annotators agreed that this did not convey any RE information. However, other examples do seem to reflect some of the variety in which physicians interpret indicators, such as a sentence with the indicator “Portuguese” that had votes for all RE categories except Asian and Native Hawaiian or Other Pacific Islander.

For the second approach examining false positive modelling mistakes, these models are further described in Bear Don’t Walk et al., and were trained, validated, and tested on the C-REACT dataset to identify sentences with information for the RE categories Black/AA, Asian, white, and Latino57. All modelling false positive sentences and vote counts were inspected. Upon examination, most false positives by the model were modelling mistakes, and did not indicate any mistakes by the annotators. In one case, the model assigned a label and our examination revealed that there could be a positive label by annotators. In this case, only four annotators had assigned the label Asian to a sentence with the indicator “Sri Lankan”.

Overall, these analyses into lower agreement sentences indicates that there are likely nuances with how certain textual data is used to infer RE labels by physicians. Additionally, we have attempted to manually identify and correct potential errors, where appropriate, to ensure high quality data. However, given the potential for rich interpretation by annotators there may be assignments that not all researchers will agree with. These different interpretations may be a signal to pause and look further into potential assumptions leading to an RE assignment. The voting data for each assignment provides both nuance and potential limitations, allowing users to investigate lower agreement labels.

Usage Notes

Data are publicly available through PhysioNet and are subject to their credentialing process and data use agreement. The credentialing process includes training on HIPAA, human subjects research, privacy and confidentiality, and principles to support the ethical conduct of research. Furthermore, users must sign a data use agreement to openly share code related to publications using MIMIC-III data while protecting data security and patient privacy. The authors affirm that they have followed these data use and ethical guidelines as well. Approved users can download data from the C-REACT dataset on PhysioNet project (https://doi.org/10.13026/t9ka-6k29)through PhysioNet.

We recommend using the Python library pandas to work with the provided files (e.g., the jsonl files may be read using the ‘read_json(‘file_name.jsonl’, lines=True)’ function. We provide code to work with the raw RE label annotation files and functions to determine the majority vote assignments. Additionally, we provide examples for working with indicator span data in ‘indicators_df.jsonl’. Working with the raw RE assignment files can be accomplished using the scripts and functions provided in the GitHub repository discussed in the Code Availability section below. While we provide a single jsonl file for the RE labeling, researchers should be aware that sentences from the shared annotation and single annotation subsets are differentiated using the “shared_subset” and “single_subset” columns. The columns “patient_id”, “visit_id” map to columns within the original MIMC-III data. The mappings from C-REACT to MIMIC-III are “patient_id” to “SUBJECT_ID”, and “visit_id” to “HADM_ID”.

If this corpus is used to train models to infer RE from clinical text, we suggest that researchers split training and test sets along patient ID rather than visit ID to limit data leakage and better emulate real-world settings. Finally, it should be noted that these sentences are not drawn randomly from MIMIC-III clinical notes and that we prioritized sentences with likely documentation of RE labels and indicators, while also drawing clinical notes without this information. This was done to balance identifying positive labels while limiting sampling bias. Please see Table 1 for more information on how these sentences are distributed in MIMIC-III clinical notes. Finally, researchers using the C-REACT dataset should be aware that while MIMIC-III offers a great opportunity to train models on real-world clinical data (often difficult to obtain given security and privacy concerns), MIMIC-III comes from a non-representative health organization in Boston, Massachusetts. While the indicators in this work likely sufficiently covered this population, they might not generalize to other populations or discussion of race and ethnicity-specifically for people from Native American, Alaskan Native, Hawaiian, and Pacific Islander populations, of which there is limited representation in our corpus.

Beyond technical usage notes, there are also notable ethical concerns. The C-REACT dataset is intended to inform future research about how granular RE-related information manifests in clinical notes and can be used to infer RE labels through NLP. While creating this dataset we balanced the importance of granular information with the established use of broader RE categories from the U.S. census. We encourage future researchers to be intentional and transparent about their assumptions and definitions for RE categories when using this dataset for whatever level of granularity. Importantly, researchers should consider if the federally recognized categories provided in this dataset are appropriate or if the more granular information from RE indicators are needed. Finally, self-reported RE data is still the reference standard and we cannot guarantee that what is reported in the clinical note is reflective of a patient’s self-reported racial and ethnic identity. Still, RE information derived from clinical notes can be used to complement self-reported RE information and mitigate missingness, while potentially providing more nuanced information. Because we strongly believe that granular information can provide key insights on discussions of broad RE categories and that RE categories are dependent on the research question at hand, we do not provide pre-defined training, validation, and test sets for benchmarking purposes.

RE labels played a dominant role in the analysis presented here. While RE labels can be used for sentence classification and RE indicators can be used for span level tasks, there are many nuances in how these two sets of annotations can be leveraged. Thus, we provide a non-exhaustive list of research projects that can make use of the indicator and/or label data. As previously mentioned, Bear Don’t Walk et al., used the RE labels to train models to identify sentences with positive RE mentions and assessed learned associations between textual inputs and model classifications57. C-REACT’s combination of sentence-level labels and span-level indicators allowed Bear Don’t Walk et al., to assess how well model-derived salient features aligned with the manually identified indicator spans and found that high classification performance may mask potentially concerning learned associations. Future research may leverage span-level features to augment label classification training while pushing models to use certain features and feature types58. Additionally, different interpretations between physician RE label annotations can be incorporated into model training to estimate uncertainty while improving task performance59. Researchers may leverage only the RE label data to train a model and interrogate differences between structured and text-based sources for RE data within the EHR. In the case that researchers choose to forgo the sentence-level labels, spans can be used for named entity recognition and leveraged in downstream analyses such as large-scale analysis into patterns of RE discussion for various patient groupings.