Background & Summary

Pharyngitis, commonly referred to as sore throat, is an inflammation of the pharynx that primarily affects the ear, nose, and throat (ENT) system, impacting millions of individuals annually1. This condition can manifest as nonbacterial pharyngitis, often caused by common viruses such as adenoviruses and rhinoviruses, or as bacterial pharyngitis, most commonly attributed to Group A Streptococcus bacteria2. Accurate differentiation between these two forms is critical, as bacterial infections typically require antibiotic therapy, whereas nonbacterial infections do not3.

The standard diagnostic process for pharyngitis relies on a combination of patient history, physical examination, laboratory tests including rapid antigen detection tests (RADTS), and throat cultures4. However, these methods can be time-consuming and may not provide immediate results5, delaying appropriate treatment. Accurate identification of bacterial versus nonbacterial infections is essential for determining whether antibiotics are necessary. In cases of bacterial infection, timely antibiotic administration is crucial to avoid complications5,6. Conversely, the misdiagnosis of nonbacterial infections and the resulting unnecessary use of antibiotics contribute significantly to the global issue of antibiotic resistance7, a growing public health crisis that threatens the effectiveness of these life-saving drugs8.

In response to the need for more accurate and rapid pharyngitis diagnoses, we developed the PGUPharyngitis dataset, comprising 742 high-resolution images of patients’ throats. Each case in the dataset was meticulously annotated with insights from four to nine physicians who reviewed and labeled the records based on the images and symptoms to ensure the highest diagnostic accuracy. The images were captured using two smartphones, the Samsung Galaxy S21 Ultra and the Xiaomi Redmi 8, selected for their ability to produce clear and detailed visuals. Figure 1 presents two samples from the dataset, one being bacterial and the other nonbacterial.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

(a) Example of a nonbacterial throat image from the dataset (b) Example of a bacterial throat image from the dataset.

Data collection was conducted in two Iranian cities with distinct climatic conditions—one with a colder, mountainous climate and the other with a warm, humid, and coastal environment—to ensure the dataset represents a diverse range of patients from varied geographic and environmental backgrounds. The data was collected from October 2023 to May 2024. This dataset aims to facilitate the training of advanced deep learning models designed to improve the accuracy and reliability of pharyngitis diagnosis.

By providing a rich and diverse collection of throat images, the PGUPharyngitis dataset is the largest publicly available dataset in this field, and addresses a critical gap in existing medical data resources, offering researchers and healthcare professionals the opportunity to develop and refine intelligent diagnostic tools. These tools can operate effectively in various clinical settings and support the integration of machine learning techniques into healthcare.

The availability of this dataset is a crucial step toward advancing the field of automated diagnostics and tackling the challenges posed by misdiagnosis and unnecessary antibiotic use. Accurately diagnosing a condition can be difficult, even for experienced physicians. In fact, within our dataset, there were cases where physicians disagreed on the diagnosis for a single patient. Our goal is to substantially enhance healthcare outcomes, particularly in regions with limited access to specialized medical facilities. Since the images are captured using smartphones, the dataset can support the integration of deep learning models into smartphone applications. This enables rapid and accurate diagnostics, empowering patients to actively manage their health and reducing the risk of complications caused by misdiagnoses or the overuse of antibiotics9.

The work by Yoo et al.10 shares a similar goal with our study, aiming to leverage smartphone-captured throat images and deep learning for pharyngitis diagnosis. Their dataset included 131 pharyngitis images and 208 normal throat images, collected from online sources and augmented using CycleGAN. However, the reliance on web-sourced images raises concerns regarding the dataset’s representativeness and clinical validation. Additionally, synthetic augmentation may not fully reflect real-world variability in throat presentations. In contrast, our dataset provides significant advantages. It consists of 742 high-resolution images gathered directly from diverse geographical and clinical settings, accompanied by detailed demographic and symptom data. These attributes enhance its diversity, applicability, and reliability for real-world clinical applications, supporting the development of robust AI-driven diagnostic models designed for practical healthcare scenarios.

Methods

Data Collection

The data for this study were meticulously gathered during patient visits to general practitioners across two geographically distinct regions of Iran. One region, located in the mountainous areas, experiences a cold climate, while the other, situated near the coast of Persian Gulf, is characterized by a hot and humid environment. These contrasting climatic conditions provided a unique opportunity to collect data from a diverse range of environmental contexts, enriching the dataset and enhancing its applicability across various population segments. Data collection occurred from October 2023 to May 2024, focusing on patients with common cold related symptoms. Participation was entirely voluntary, with patients being fully informed about the research objectives and potential outcomes. Patients provided informed written consent, after being informed by the attending physician of the study’s purpose, voluntary nature, and confidentiality measures, including permission for anonymized clinical data and images to be used in current and future research and publications internationally. Ethical approval was obtained prior to the study (research ethics committee Approval ID: IR.BPUMS.REC.1403.282), and informed consent was secured from all participants.

All collected data, including throat images and associated demographic information, were anonymized and used solely for research purposes to advance the development of deep learning diagnostics. The risk of participant identification was carefully considered. All personally identifiable information, including names, dates of birth, contact details, and any other direct identifiers, was removed from the dataset. Each participant was assigned a unique code to replace identifying information. Additionally, facial features in medical images (if present) were obscured or cropped where applicable.

Image Processing

Throat images were captured using two smartphone models—Samsung Galaxy S21 Ultra and Xiaomi Redmi 8 Pro—selected for their high-quality cameras. Using two smartphone models introduced natural variations in image quality, enriching the dataset. Additionally, the differing lighting conditions in the two cities further contributed to the robustness of the dataset. Each throat image was taken in a well-lit room, utilizing the smartphones’ flashlight function to ensure clear visibility of the throat area. The camera was positioned directly in front of the patient’s open mouth, focusing on the back of the throat to capture the most relevant region for pharyngitis diagnosis.

Alongside the images, key demographic and clinical data were recorded, including the patient’s age, gender, and symptoms. These additional data points allowed for a more comprehensive analysis and facilitated the exploration of potential correlations between demographic characteristics and the different types of pharyngitis.

Following the collection, the images underwent a rigorous quality control process. Each image was carefully reviewed for clarity, with blurred or poorly lit images excluded from the dataset. Misaligned images were manually corrected through rotation and cropping to ensure uniformity. The throat area was emphasized to focus on the regions most indicative of pharyngitis. The final images in the repository are cropped sections of the original images and were not resized. Therefore, they each have varying resolutions. After this initial review, a second round of quality control was performed to ensure the dataset’s reliability and uniformity. Ultimately, images that at least 3 physicians agreed were not suitable for diagnosis have been excluded from the dataset, and 742 high-quality images were selected from the original 860 collected.

Diagnostic Process

Upon finalizing the dataset, a standardized diagnostic process was used to classify each image based on the type of pharyngitis. This classification was performed by a team of experienced physicians. On average, each image was reviewed by six physicians, although the number of reviewers varied, with some images being assessed by as few as four and others by as many as nine experts. The goal was to ensure a reliable and accurate classification by leveraging the collective expertise of the physician team. One doctor made the diagnosis by examining the patient’s throat in person, while the others based their diagnoses solely on images.

Pharyngitis is typically classified into two main categories: bacterial and nonbacterial (e.g., viral, allergic, fungal, and normal). Similarly, our records are divided into these two categories: bacterial and nonbacterial. Since there have been no cases of fungal pharyngitis in this dataset, our nonbacterial category includes only viral, allergic, and normal cases. In general, the treatment for all these cases is supportive care. A majority vote among the physicians determined the final classification for each image. In cases where significant discrepancies in diagnosis were observed, the image was reassessed by an additional physician who provided an independent evaluation. This step helped resolve any inconsistencies and ensured a high degree of accuracy in the final dataset. The overall workflow and summary of the dataset creation process are presented in Fig. 2 and Table 1.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

The overall data acquisition and quality assessment workflow used in this work.

Table 1 The stages involved in creating the dataset and their respective descriptions.

Data Records

The PGUPharyngitis dataset is a publicly accessible collection containing clinical and demographic data alongside diagnostic results from 742 patients who presented with pharyngitis symptoms. This dataset is instrumental for training and evaluating machine learning models aimed at differentiating between bacterial and nonbacterial pharyngitis. Each entry comprises comprehensive symptom indicators, demographic details, and verified diagnostic outcomes.

Data Format

The dataset is organized as a structured Excel file, where each row represents an individual patient. The columns are arranged to include patients’ age, gender, symptoms, and diagnoses. Table 2 describes fields in the dataset.

Table 2 The variables and their descriptions in the PGUPharyngitis dataset.

Symptom Indicators and Diagnostic Data

This dataset includes 20 binary symptom indicators that capture common clinical features of pharyngitis, such as sore throat, headache, and fever, outlined in Table 2. Each symptom is represented by a binary value, where 1 indicates the presence and 0 is the absence of the symptom. Additionally, 9 columns in each patient record contain the diagnostic results, where each patient might have between 4 to 9 diagnoses by different physicians.

Quality Control and Data Validation

To ensure data integrity, the PGUPharyngitis dataset underwent rigorous quality control and validation processes, and records missing essential information were excluded from the final dataset. Diagnostic results were confirmed via reviews conducted by multiple physicians, resulting in high reliability of the diagnostic data.

Technical Validation

The dataset’s statistical breakdown is detailed in Figs. 3 and 4. As illustrated in Fig. 3a, the most common symptoms among patients are sore throat, cough, and rhinorrhea, affecting 19.8%, 14.7%, and 12.1% of the sample, respectively. Figure 3b,d validate that the dataset is balanced in terms of age and gender distribution, as it includes a broad range of age groups, with nearly equal representation of male and female participants (51.6% and 48.4%, respectively).

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Statistical analysis of the dataset (a) Distribution and prevalence of symptoms, each patient can have up to 20 symptoms (b) Distribution of age groups in the dataset and the number of patients in each age group (c) Percentage and total number of bacterial and nonbacterial diagnoses (d) Percentage and number of male and female participants. (e) Distribution of the number of diagnoses per patient.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Symptoms co-occurrence heatmap.

Moreover, to validate and evaluate the dataset, as well as to demonstrate a sample use case, we conducted experiments using four different models: DenseNet12111 Swin Tiny12, MobileNet V3 Small13, and ConvNeXt Small14. These models were used to develop a binary image classifier (i.e., classifying the images into bacterial or nonbacterial). The inputs of the models are images of size 224*224, and each image in the dataset has a different size, they are resized in the training process to 224*224 as well. We trained each model on four different subsets of the dataset: the full dataset, and individual subsets corresponding to the warm city, cold city, Redmi phone, and Samsung phone images.

The results of these experiments are shown in Table 3. To mitigate any potential bias due to data splitting, we applied 3-fold cross-validation. K-Fold cross-validation (K = 3) involves dividing the dataset into K equal subsets (folds). The model is trained K times, each time using K-1 folds for training and the remaining fold for validation. This process ensures that every data point is used for both training and validation exactly once. It reduces the risk of performance variance due to a particular train and validation split and provides a more generalized estimate of the model’s accuracy.

Table 3 Results obtained from training DenseNet121, Swin Tiny, MobileNet V3 Small, and ConvNext Small using different subsets our dataset for binary classification.

The experiments used a batch size of 20, the Adam optimizer15, and data augmentation techniques including random rotation, horizontal flip, and random affine transformations. The results are presented across five metrics: accuracy, precision, recall, F1-score, and AUC. While using all of the dataset to train the model, MobileNet outperforms the other models in all metrics except for AUC, achieving an accuracy of 80.50% ± 12.79%, a precision of 69.30% ± 21.99%, a recall of 55.26% ± 1.75%, an F1-score of 54.26% ± 1.81%, and an AUC of 0.554 ± 0.073. Although these models show satisfactory performance, incorporating the symptoms, exploring other state-of-the-art models, and developing new architectures to improve metrics remains an ongoing challenge.

Usage Notes

The dataset is openly available on Figshare (https://doi.org/10.6084/m9.figshare.28163513)16. The root directory contains a folder with all the images, each labeled with a unique identifier. Additionally, an Excel file is included, listing each patient’s symptoms, age, gender, and diagnosis.

Images in this dataset were obtained using smartphone cameras under carefully managed yet varied lighting conditions to enhance the visibility of the throat area. Variations in smartphone models, handling, lighting, and patient positioning may affect image color, brightness, and clarity. For instance, blood pressure and oxygenation can affect the color of the mucous membranes. To mitigate the potential effects of the aforementioned factors on the visibility of redness or inflammation, histogram matching during the preprocessing stage is recommended to standardize brightness and color. Additionally, data augmentation techniques, like minor rotations or lighting adjustments, are recommended to adapt the model to real-world conditions.

In some images, the focus is on the lips rather on the throat. As one of the main goals of this dataset is to support the development of remote healthcare solutions and mobile applications, particularly for use in underdeveloped regions. Therefore, in such scenarios, users are likely to capture throat images themselves using mobile devices, which can often result in suboptimal focus. Including these images ensures that AI models trained on the dataset are robust and better equipped to handle real-world usage conditions. Future researchers are free to exclude these images if their specific use cases require only high-quality, well-focused data.

The collected dataset reflects the actual distribution of symptoms and patients in the cities where the data was gathered. Since the real-world distribution of symptoms among patients is imbalanced, this imbalance is also present in the dataset. This point has been added to the manuscript. Therefore, future researchers using this dataset are encouraged to apply methods that address the effects of imbalanced data.

To complement the images, essential contextual information is included, such as the patient’s age, gender, and presence or absence of 20 types of symptoms. This supplementary data provides insights that can significantly enhance diagnostic accuracy and model performance. Incorporating this information allows researchers to gain an understanding of the diagnostic results and refine models to better address real-life scenarios.

Future users of the data are encouraged to use methods that address label uncertainty such as stochastic and probabilistic methods. Among the samples, 182 images had full consensus among all doctors, with identical diagnostic opinions. This subset can serve as a “gold standard” for model evaluation. Future researchers can use these 182 samples as test data, while using the remaining data (potentially with probabilistic or self-supervised methods) for training the model.

While the dataset includes a wide range of bacterial and nonbacterial pharyngitis cases, some limitations exist. Rare forms of pharyngitis, such as those due to fungal infections, are underrepresented due to their rarity. Additionally, cases of chronic pharyngitis, which may present distinct clinical characteristics compared to acute cases, are not included.