A clinical narrative corpus on nut allergy: annotation schema, guidelines and use case

González-Moreno, Ana; Ramos-González, Alberto; González-Carrasco, Israel; Alonso Díaz de Durana, M. Dolores; Sellers Gutiérrez-Argumosa, Beatriz; Moncada Salinero, Alicia; Pastor-Magro, Ana Belén; González-Piñeiro, Beatriz; Tejedor-Alonso, Miguel A.; Martínez, Paloma

doi:10.1038/s41597-025-04503-0

Download PDF

Data Descriptor
Open access
Published: 29 January 2025

A clinical narrative corpus on nut allergy: annotation schema, guidelines and use case

Ana González-Moreno¹,
Alberto Ramos-González²,
Israel González-Carrasco ORCID: orcid.org/0000-0001-8294-3157²,
M. Dolores Alonso Díaz de Durana¹,
Beatriz Sellers Gutiérrez-Argumosa¹,
Alicia Moncada Salinero¹,
Ana Belén Pastor-Magro³,
Beatriz González-Piñeiro³,
Miguel A. Tejedor-Alonso¹ &
…
Paloma Martínez²

Scientific Data volume 12, Article number: 173 (2025) Cite this article

1675 Accesses
1 Citations
Metrics details

Subjects

Abstract

This article describes a dataset on nut allergy extracted from Spanish clinical records provided by the Hospital Universitario Fundación de Alcorcón (HUFA) in Madrid, Spain, in collaboration with its Allergology Unit and Information Systems and Technologies Department. There are few publicly available clinical texts in Spanish and having more is essential as a valuable resource to train and test information extraction systems. In total, 828 clinical notes in Spanish were employed and several experts participated in the annotation process by categorizing the annotated entities into medical semantic groups related to allergies. To evaluate inter-annotator agreement, a triple annotation was performed on 8% of the texts. The guidelines followed to create the corpus are also provided. To determine the validation of the corpus and introduce a real use case, we performed some experiments using this resource in the context of a supervised named entity recognition (NER) task by fine-tuning encoder-based transformers. In these experiments, an average F-measure of 86.2% was achieved. These results indicate that the corpus used is suitable for training and testing approaches to NER related to the field of allergology.

Thermography based skin allergic reaction recognition by convolutional neural networks

Article Open access 16 February 2022

Addressing knowledge gaps in allergies among Syrian hospital patients: a cross-sectional study

Article Open access 05 February 2024

Ontology-driven weak supervision for clinical entity classification in electronic health records

Article Open access 01 April 2021

Background & Summary

Tree nut allergy poses a challenge in the field of allergology due to the adverse reactions that a patient may suffer from, such as anaphylaxis. These reactions are a consequence of the immune system’s actions on specific proteins present in nuts. Their severity can vary from mild symptoms, like itching and hives, to more severe manifestations that affect the respiratory and cardiovascular systems. Anaphylaxis is a severe and potentially life-threatening immune system response. Its main characteristics include symptoms of difficulty breathing, swelling of the airways, a drop in blood pressure, and anaphylactic shock. Therefore, this serious reaction is a cause for concern in the medical community, as it requires urgent medical intervention in cases of rapid progression. Nut allergies can behave as hidden allergens, and small amounts of these foods can produce severe reactions. In addition, they are a cause of cross-reactivity and co-sensitization between various nuts or with other plant foods (fruits, seeds, etc.), making their clinical management and diagnosis difficult and constituting a health problem.

In this work, a corpus of clinical progress notes annotated by allergists is introduced. This corpus is a valuable resource to train and test systems to process Spanish clinical texts. Spanish is the fourth most spoken language in the world¹; compared to English, Spanish is a highly inflectional language with a richer morphology; morphemes signify many syntactic, semantic, and grammatical aspects of words (such as gender, number, etc.). From a syntactic perspective, Spanish texts feature more subordinate clauses and lengthy sentences with a high degree of word order flexibility; for example, the subject is not restricted to appearing solely before the verb in a sentence. This means that Spanish clinical texts have particularities that require specific language resources to train and test language processors.

Regarding the creation of corpora in the health domain, different efforts have been made by the scientific community in the field of clinical texts in Spanish. In², the authors introduce the first iteration of the NUBes corpus (Negation and Uncertainty annotations in Biomedical texts in Spanish). As part of an ongoing study, 29,682 sentences from anonymized health records, annotated with negation and uncertainty, comprise the corpus. The research offers the primary annotation and design choices and a thorough comparison with comparable Spanish corpora. The authors present that NUBes is the biggest negation corpus available to the general public in Spanish and the first to include annotations for speculative cues, scopes, and events. Another approach was presented in³, where the authors describe an open-source corpus for section identification in Spanish health records. A corpus of unstructured clinical records, in this case, progress notes written in Spanish, was annotated with seven major section types. As a result of this research, an annotated corpus was defined, and the designed new evaluation script and a baseline model were freely available to the community. In⁴, the authors defined a process for the identification and extraction of relevant symptom-related data within the medical notes written in Spanish. In this case, a corpus of 98 electronic medical records of patients diagnosed with the new coronavirus SARS-CoV-2 (COVID) was used. With the collaboration of three experts, each medical note from the COVID-19 patient corpus was manually labelled. Moreover, a systematic review of clinical texts in languages other than English performed in⁵ highlights the research of⁶ for Spanish. The definition of a gold standard corpus of adverse drug reactions, known as IxaMed-GS, is one of the primary contributions of this research. A team of physicians and pharmacists who worked in pharmacology and pharmacovigilance in a Spanish hospital took a year to annotate the corpus manually. The goal of the corpus annotation was to identify adverse drug reactions in discharge reports⁵.

Related to the use of Large Language Models (LLM) to process Spanish clinical narrative, in this work we used encoder-based transformers to validate the corpus in a Named Entity Recognition (NER) task on Spanish clinical texts. Previous works described in⁷ and⁸ reported results using bidirectional encoder representations for transformers (BERT-based architectures) and Bi-LSTM-CRF models (a bidirectional long short-term memory plus Conditional Random Fields layer) on Spanish clinical cases, achieving an F-score of 88.8% on entity identification and classification. Additionally⁹, addressed the identification of negation and speculation, two relevant phenomena analyzing clinical documentation, integrating deep learning architectures. It was evaluated for English and Spanish languages on biomedical corpora, particularly with BioScope and IULA, with F-measures of 86.6% (BioScope) and 85.0% (IULA).

The field of Natural Language Processing (NLP) is essential for improving access to relevant clinical information, where standard corpora are needed to refine and optimize information extraction systems from unstructured data. This work aims to obtain a corpus on allergology from Spanish clinical notes about patients suffering from nut allergies. To determine the validity of the dataset some experiments for NER in the medical field were conducted to address the issue of information accessibility for future applications. Therefore, to the best of our knowledge, no open-source corpus has been defined in Spanish containing information related to clinical notes focused on allergy processes.

In this section, the methodology to collect and create the corpus and the experiments done to test the validity of the corpus are described. As illustrated in Fig. 1, the overall methodology comprises different phases encompassing the steps of (1) Data collection, (2) Data pre-annotation, (3) Data annotation, and (4) Validation of the corpus. Firstly, a preliminary set of annotation guidelines based on using medical dictionaries are defined. These dictionaries are created by obtaining terms for the different semantic groups from SNOMED-CT¹⁰ complemented with terms provided by the doctors of the Allergology Unit at Hospital Universitario Fundación Alcorcón (HUFA), a public health institution of the Madrid Health Service. This allows to pre-annotate the clinical notes, generating an initial dataset. Subsequently, the clinical notes are preprocessed, and a specific dataset is selected for manual revision and annotation. An annotation environment is set up with the Doccano tool¹¹, allowing physicians to manually annotate the pre-annotated dataset following the established guidelines. The physicians review and annotate the pre-annotated dataset to generate a high-quality annotated dataset where disagreements are discussed and annotation guidelines are refined. This labelled dataset is prepared for training and testing an encoder-based transformer NER model.

This research has obtained the approval from the Ethical Committee for Research with Medicines, Hospital Universitario Fundación Alcorcón. 2nd June 2020. Reference number: 20/97.

Text sources

The HUFA has provided 1,333,678 fully anonymized medical records from the Allergology Unit and Emergency Department, covering the period from 1998 to 2021. These medical records are associated with patients presenting symptoms or clinical pictures related to allergic reactions. From these texts, a subset of 235,040 records related to nut allergies was used. Finally, medical experts selected 828 highly relevant records that included cases of anaphylaxis in patients, assessing a dataset with a vocabulary size of 8,430 unique tokens. Physicians selected the notes from among those available for patients with varying degrees of severity of nut allergies and anaphylaxis. What they were looking for was to locate clinical progress notes of adequate length (with different templates as is shown in Table 2) detailing diagnoses and tests for allergy and anaphylaxis to build a rich resource for NLP.

The collection of texts has a total of 70,272 words and 3,938 sentences, with an average of 85 words and five sentences per note. The maximum number of words in a text is 533, and 50 sentences. The notes contain medical terms that pose a complex comprehension challenge for non-medical professionals. Clinical notes follow a different structure depending on the template used to collect patient information. The types of templates are anamnesis, personal and family history, physical examination, medical-evolution, diagnostic tests, summary of the situation, diagnosis, medical treatment, and recommendations.

The texts are written in informal clinical writing where typos, abbreviations, and incomplete sentences are found. Some clinical notes may contain results of analyses or skin tests performed on the patient. There are typographical errors caused by rushed medical care, tokenization errors, and words that should not be anonymized. As for the anonymization of non-sensitive information, the cases where this occurs are very few and are due to the strict anonymization rules previously implemented in the hospital.

Text selection

Concerning the methodology used in the selection of clinical notes, first, texts that integrate only one character, symbol, or letter are eliminated. In this work, a text is a clinical note that contains a unique identifier (note ID generated by the hospital system) and that integrates information about the patients collected by doctors. Secondly, a subset of 10,176 patients with 235,040 notes was selected, where the information contained in the patient texts is related to nut allergies. For this purpose, the medical staff creates a nut allergy dictionary with relevant terms to search the notes. It comprises seven semantic groups (Comorbidities, Manifestations, Allergy, Nut allergy, Cofactors, Proteins, and Treatments) with terms extracted from SNOMED CT¹² and some terms collected by HUFA. Table 1 presents the characteristics of the dictionary.

Table 1 Distribution of dictionary terms per semantic group.

Subjects

Abstract

Similar content being viewed by others

Thermography based skin allergic reaction recognition by convolutional neural networks

Addressing knowledge gaps in allergies among Syrian hospital patients: a cross-sectional study

Ontology-driven weak supervision for clinical entity classification in electronic health records

Background & Summary

Text sources

Text selection

Annotation scheme

Pre-annotation process

Annotation process

Data Records

Technical Validation

Experimental setup

Results

Statistical Analysis

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links