Abstract
Objectives This study aimed to collect, categorise, and annotate a comprehensive dataset of intra-oral clinical images specifically designed for artificial intelligence training and testing.
Materials and methods Full-mouth clinical photos were collected from patients attending the Oral Medicine Clinic at the Faculty of Dentistry, Cairo University. A clinical oral examination was performed by two oral medicine specialists to establish and document the clinical diagnosis. The dataset was comprehensively annotated using LabelMe.exe in JavaScript Object Notation (JSON) format.
Results The dataset comprises 9,201 intra-oral images, which are subdivided according to the presence and type of oral lesion. These include 4,405 images classified as ‘normal', 2,314 as ‘low risk', and 2,482 as ‘high risk'. The image dimensions range from a minimum of 40,992 pixels to a maximum of 24,216,480 pixels. It includes a wide variety of oral lesions in all sites of the oral cavity, ensuring a comprehensive representation of different diseases. A significant number of images present periodontal diseases, while the dataset also features various classes of carious lesions from different intra-oral views, supporting research in conservative dentistry.
Conclusion Researchers can use the annotated dataset in the JSON format for training, validating and testing deep learning algorithms.
Key points
-
The dataset includes 9,201 high-quality intra-oral images without augmentation.
-
Images of different intra-oral sites are available in homogenous numbers. Images are categorised into normal, low, and high risk of malignant transformation and include detailed annotations outlining normal structures and oral lesions.
-
Provides a comprehensive inclusion of a diverse range of oral lesion categories.
Similar content being viewed by others
Introduction
Oropharyngeal cancer is considered the sixth most commonly reported cancer worldwide.1 Oral potentially malignant disorders (OPMDs) are any oral mucosal abnormalities that exhibit an increased risk of malignant transformation and developing cancer.2 The overall reported rate of malignant transformation of OPMDs is 7.9%; however, they are frequently misdiagnosed due to their diverse clinical presentation.3,4
Early detection of OPMDs and oral cancer is a priority as it improves the patient's prognosis.5 However, the majority of dentists lack expertise in the early detection of oral cancer, causing delayed referrals from family dentists to core hospitals.6,7 With recent advancements, various applications of machine learning are increasingly being used to provide effective healthcare support to individuals.8 Recently, in the era of remote diagnostics, artificial intelligence (AI) has entered the chat.
AI can accurately analyse massive datasets and has the capacity to continuously learn using new data.9 This paved the way for researchers to start training AI models by image analysis to detect oral lesions, particularly oral cancer.4,10 The effectiveness of these diagnostic methods largely relies on the availability of digital image repositories, which are crucial for training algorithms and subsequently aiding in automated diagnosis.11 The size, quality, diversity and proper annotations of these repositories are essential for developing robust and reliable AI models that perform well in various scenarios.12
The creation and annotation of datasets present considerable challenges concerning expertise, patient data management, security and confidentiality.13 A significant barrier hindering research is the absence of publicly available intra-oral clinical photographic datasets for both training and validation.14 Publicly available datasets with a variety of diseases and normal findings are urgently needed to use AI models for the early detection of oral cancer, preventing the unnecessary waste of time and resources in collecting this data from scratch each time.
Thus, we aimed to collect, categorise and annotate a comprehensive dataset of white light, intra-oral clinical images for AI training and testing. The dataset introduced by this paper contains normal intra-oral clinical photos, a diversity of diseases with low and high risk of malignant transformation, and oral cancer.
Materials and methods
Dataset collection
The current study was conducted in the Oral Medicine Clinic at the Faculty of Dentistry, Cairo University from March 2024 to December 2024. The study was approved by the Research Ethics Committee of Faculty of Dentistry, Cairo University (approval number: 6-3-24).
The inclusion criteria encompassed individuals over the age of 18 years with normal oral cavity findings, variations of normal, and patients presenting with various oral lesions. The study included participants of all ages and genders. A clinical oral examination was performed by two oral medicine specialists to establish and document the clinical diagnosis. Informed written consent was obtained from the patients before capturing full-mouth intra-oral photographs with proper lighting.
For image standardisation, the following sites were photographed:
-
Occlusal view of the entire maxillary arch (including the hard palate)
-
Occlusal view of the complete mandibular arch (including the floor of the mouth)
-
Right and left buccal mucosa
-
Upper and lower labial mucosa
-
Dorsum of the tongue (with the patient instructed to extend the tongue)
-
Ventral surface of the tongue (with the patient instructed to raise the tongue to touch the hard palate)
-
Right and left lateral sides of the tongue
-
Frontal view with the patient in centric occlusion.
Mirrors, gauze and fingers were placed without obstructing the lesions for future boundary boxes placement, and any removable dentures were removed if present.
The intra-oral photographs were taken using a variety of devices to represent the range of equipment employed by dentists. Specifically, a DSLR Canon 800D with a ring flash was used, along with smartphones, including the iPhone 11, iPhone 12 Pro, Samsung Galaxy 52, Honour 9X, and Realme 7 Pro, all using their respective mobile flashes. The images were captured on the dental unit, with all views taken in the upright position, except for the palate and floor of the mouth views, which were taken in a supine position.
The dataset consisted solely of original images, with minimal or no adjustments to brightness and cropping to remove extraneous extra-oral structures. Any augmented photos were excluded, as well as any images that were blurred or had poor lighting. The entire dataset was stored in JPG format, with all other formats (HEIC, PNG, CR2) converted to JPG. The dataset was anonymised and stored on a secured server for organisation.
Dataset organisation and categorisation
Two oral medicine specialists, each holding a PhD and over ten years of clinical experience, categorised the dataset into three distinct classes for comprehensive evaluation: a) normal findings (‘normal'); b) low risk of malignant transformation (‘low risk'); and c) high risk of malignant transformation (‘high risk'). The ‘normal' category encompassed all images without mucosal changes and included normal variations. The ‘high risk' category comprised images of oral cancer and potentially malignant oral disorders according to the latest classification presented by the World Health Organization,2 while the ‘low-risk' category included images that didn't fall into the other two categories.
All cases categorised as ‘high risk', including those diagnosed as oral cancer or potentially malignant disorders, were confirmed by histopathological examination. The ‘low-risk' category comprised of lesions that are typically diagnosed from the clinical picture where biopsy isn't standard practice, such as aphthous ulcers, herpes simplex infections, and petechiae, for which biopsy is not routinely indicated. However, in instances where clinical judgement warranted further investigation, biopsies were performed. All low-risk lesions that were clinically indicated for biopsy (e.g., fibromas, lipomas, and pyogenic granulomas) and subsequently included in the dataset were likewise confirmed histopathologically.
The dataset included a homogeneous sample for all intra-oral sites in the ‘normal' category, ensuring consistency across the data. Variability was observed across the different sites while maintaining overall uniformity in the dataset. Table 1 provides a detailed breakdown of the sites, and outlines the distribution and characteristics of each recorded intra-oral location.
The dataset was comprehensively annotated using LabelMe.exe in JavaScript Object Notation format. For the ‘low-risk' and ‘high-risk' categories, boundary boxes were delineated around the perimeters of the disease to represent the ground truth. In instances where multiple lesions appeared within a single image, corresponding multiple boundary boxes were applied. Care was taken to ensure that these boxes encompassed only the diseased areas, excluding normal mucosa. For the ‘normal' category, boundary boxes were placed at the designated sites, including any variations of normal tissue within the labelled regions as shown in Figure 1. The link to the dataset is available at https://zenodo.org/records/14571990.
Representative images from the annotated dataset, including: normal category in (A) Fordyce granules in buccal mucosa, (B) palate, (C) mandibular tori, (D) ventral surface of the tongue; low-risk category in (E) aphthous ulcer, (F) fibromas, (G) palatal papillary hyperplasia, (H) mucocele; and high-risk category containing oral potentially malignant disorders in (I) plaque-like oral lichen planus, (J) leukoplakia, (K) actinic cheilitis, (L) erosive oral lichen planus; plus oral cancer in (M) squamous cell carcinoma in lateral border of the tongue, (N) Kaposi sarcoma, (O) verrucous carcinoma, and (P) squamous cell carcinoma in palate
Results
Quality of dataset
The dataset comprises 9,201 intra-oral images, with dimensions ranging from a minimum of 40,992 pixels to a maximum of 24,216,480 pixels and a median of 5,658,095 pixels. Figure 2 illustrates the dimensions and pixel counts of our dataset. The histograms display the distribution of image heights (left) and widths (centre), with significant peaks at 4,000 pixels for height and 6,000 pixels for width. The scatter plot (right) shows a strong positive correlation between width and height, indicating that most images are proportionally scaled, while some outliers existed.
The resolution of an image (e.g., 1,920 × 1,080) refers to the number of pixels in the horizontal and vertical dimensions, respectively, determining the image's detail and clarity. Higher resolutions have more pixels, enabling finer details to be displayed.15 In Figure 2, the histogram illustrates the image resolutions in our dataset by showing the distribution of images' dimensions.
In Figure 3, the histogram illustrates the number of pixels of images indicating their resolutions multiplied by 1e7 (10−7). The peaks around 4,000–6,000 pixels in both the height and width distributions indicate that many images are of very high resolution, contributing to a high-quality dataset. High-resolution images allow for more detailed analysis, especially in tasks like medical imaging. The scatter plot demonstrates a general consistency in aspect ratios, meaning that the images in the dataset are not distorted or resized unevenly. This uniformity further reinforces the quality of the dataset, as inconsistent aspect ratios could introduce biases or complications in image analysis tasks.
While most images are concentrated in the higher resolution range, those in the lower range are likely taken with mobile phones or cropped from larger photos. Mobile devices typically produce smaller resolution images compared to DSLR (digital single-lens reflex) cameras, yet still maintain a very acceptable level of quality.
Breakdown of images included in each category
The 9,201 intra-oral images were subdivided according to the presence and type of oral lesion, with a breakdown of 4,405 ‘normal', 2,314 ‘low-risk', and 2,482 ‘high-risk' images. Table 1 provides a detailed breakdown of these categories and their respective sites.
We meticulously included a wide range of oral lesions to ensure a comprehensive representation of diseases in the dataset. Figure 4 illustrates the specific diseases present in each class in our dataset, along with the normal variations collected and provided. The high-risk group comprised both OPMDs and oral cancer cases.
Additionally, we made sure to incorporate ‘variations of normal' that are often misdiagnosed as false positives, while the rest of the lesions were placed in the ‘low-risk' category. A representative annotated sample of each category is shown in Figure 1. Table 2 provides details on the distribution and count of normal variations across different sites, while Figure 5 illustrates the distribution of low-risk and high-risk diseases at each site.
Certain clinical presentations can be observed in various diseases, each with different prognoses. For instance, in our dataset, haemorrhagic crusted lips were noted in patients diagnosed with pemphigus vulgaris and erythema multiforme, both of which are considered low-risk conditions. However, this presentation can also occur in oral lichen planus, which is classified as high risk. Similarly, desquamative gingivitis appeared in our dataset in cases of pemphigus vulgaris, mucous membrane pemphigoid and Crohn's disease, while also being observed in patients with oral lichen planus. These conditions were classified into the low-risk group based on the overall clinical examination, except when clear signs of Wickham's striae, characteristic of oral lichen planus, were evident in the images.
Discussion
Delays in diagnosing oral lesions can stem from both patients and healthcare professionals. Contributing factors include a lack of oncological awareness, misdiagnoses, delays in referrals, and slow histopathology results.16 Although the clinical oral examination is the most widely used method for detecting oral lesions, its effectiveness is constrained by low specificity and sensitivity, primarily due to its subjective nature.17 To overcome these challenges, AI is emerging as a valuable tool for improving the early detection of oral cancer by analysing large datasets with greater precision.14
The lack of publicly available datasets has hindered researchers from advancing AI algorithms to help general practitioners in early detection of oral cancer.14 Some researchers compiled their datasets to train and evaluate their AI algorithms; however, these datasets were not made publicly accessible, limiting their utility for further research and collaboration.4,10,18,19
A dataset of oral images was found on Mendeley, captured using mobile and intra-oral cameras.20 This dataset includes 323 images, with 165 benign and 158 malignant lesions initially collected. After augmentation, the dataset expanded to 1,320 benign and 1,273 malignant images. This dataset was also used in another study to train VGG19, MobileNet, and DeIT models which concluded that a larger dataset is needed due to the limited number of images, even after augmentation.21 Additionally, a few numbers of the images from this dataset had questionable classification.
A study by Rashid et al.22 introduced a dataset comprising 517 images focused on seven oral and cavity diseases: canker sores, cold sores, gingivostomatitis, mouth cancer, oral cancer, oral lichen planus, and oral thrush. The dataset was subsequently augmented to include 5,143 images to enhance its robustness. The dataset, while valuable, exhibited some limitations. There was ambiguity in distinguishing normal tissue variations from oral lesions. It had the same concern as the previously mentioned database where some of the labelling was debatable. Lastly, the dataset did not adequately address OPMDs, which are crucial for early detection and intervention in high-risk cases.
In 2024, a study introduced an intra-oral images dataset comprising 3,000 images, without augmentation, and included annotated files.23 This dataset is notable for its multiple superior features, such as the inclusion of four categories: oral cancer, benign and oral potentially malignant disorders, and healthy mucosa. Some of the lesions were even histologically confirmed. Additionally, the dataset provides a CSV file containing patient age, sex, and binary risk factors, such as smoking, chewing betel quid, and alcohol consumption. This dataset surpasses previous datasets in terms of the number of images and the accuracy of categorisation and annotation; however, the total of 3,000 images is still insufficient for robust AI model training, testing and validation. Also, a few of the images suffered out-of-focus areas. Our primary goal was to gather a diverse collection of photos taken by various devices, recognising that pixel quality and accuracy can vary significantly between devices. This approach ensures that the AI can handle inputs from any device a user might employ. Additionally, we aimed to compile a substantial dataset, ultimately reaching 9,201 images without augmentation, to help prevent overfitting in AI models.
We focused on including a wide range of diseases, making the dataset a reliable resource for general practitioners when diagnosing conditions in their clinics, whether they fall into low- or high-risk categories for malignant transformation, aiding in lowering the rate of false-negative diagnoses. To reduce the likelihood of false-positive diagnoses, we ensured a sufficient number of images in the normal category, including various normal variations.
We aimed to include patients with fixed orthodontic brackets and wires in the dataset to ensure that the AI model recognises these appliances as part of a normal presentation, allowing it to distinguish between typical orthodontic components and potential pathological findings. This integration is essential to prevent the model from misinterpreting these common dental devices as anomalies or areas of concern during analysis.
By collecting a homogenous dataset from all intra-oral sites (as shown in Table 1), the AI becomes familiar with each subclass individually, thereby improving accuracy. Our comprehensive approach not only broadens the AI's applicability but also significantly contributes to the reliability and precision of diagnostic tools in clinical settings. Since the photographs in our dataset focus on the soft tissue and inherently include teeth with the surrounding periodontal structures, this could make the dataset valuable for other disciplines, such as periodontology. We aimed to create a dataset that serves as a versatile tool for a wide range of dental and medical applications.
Conclusion
General practitioners face significant challenges in deciding whether to refer patients with suspected oral lesions, often struggling with false-positive and false-negative diagnostic dilemmas. To address these limitations, our study provides a carefully curated dataset designed to enhance AI-driven oral cancer detection by clinical image analysis. The dataset includes normal anatomy, benign conditions, potentially malignant disorders, and histopathologically confirmed malignancies, with specialist-verified annotations categorising lesions by malignant transformation risk (low/high). By capturing the full spectrum of oral conditions under standardised imaging conditions, this resource addresses critical gaps in existing datasets for developing clinically applicable AI tools.
For future work, two critical next steps will maximise this dataset's clinical impact: 1) prospective validation of AI models in real-world clinical settings to assess diagnostic performance; and 2) strategic dataset expansion through additional intra-oral images of diverse pathologies and integration of multimodal imaging (fluorescence/thermal) for improved lesion characterisation.
Data availability
The dataset described in this study is available in the Zenodo data repository (https://zenodo.org/records/14571990). It is provided under restricted access upon per request and licenced under CC BY-NC-ND 4.0.
References
Varela-Centelles P, López-Cedrún J L, Fernández-Sanromán J et al. Key points and time intervals for early diagnosis in symptomatic oral cancer: a systematic review. Int J Oral Maxillofac Surg 2017; 46: 1–10.
Warnakulasuriya S, Kujan O, Aguirre-Urizar J M et al. Oral potentially malignant disorders: a consensus report from an international seminar on nomenclature and classification, convened by the WHO Collaborating Centre for Oral Cancer. Oral Dis 2021; 27: 1862–1880.
Iocca O, Sollecito T P, Alawi F et al. Potentially malignant disorders of the oral cavity and oral dysplasia: a systematic review and meta-analysis of malignant transformation rate by subtype. Head Neck 2020; 42: 539–555.
Fu Q, Chen Y, Li Z et al. A deep learning algorithm for detection of oral cavity squamous cell carcinoma from photographic images: a retrospective study. EClinicalMedicine 2020; 27: 100558.
González-Ruiz I, Ramos-García P, Ruiz-Ávila I, González-Moles M. Early diagnosis of oral cancer: a complex polyhedral problem with a difficult solution. Cancers 2023; 15: 3270.
Messadi D V, Wilder-Smith P, Wolinsky L. Improving oral cancer survival: the role of dental providers. J Calif Dent Assoc 2009; 37: 789–798.
Watanabe M, Arakawa M, Ishikawa S et al. Factors influencing delayed referral of oral cancer patients from family dentists to the core hospital. J Dent Sci 2024; 19: 118–123.
Liu X, Faes L, Kale A U et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health 2019; DOI: 10.1016/S2589-7500(19)30123-2.
Hegde S, Ajila V, Zhu W, Zeng C. Artificial intelligence in early diagnosis and prevention of oral cancer. Asia Pac J Oncol Nurs 2022; 9: 100133.
Gomes R F T, Schmith J, Figueiredo R M et al. Use of artificial intelligence in the classification of elementary oral lesions from clinical images. Int J Environ Res Public Health 2023; 20: 3894.
Willemink M J, Koszek W A, Hardell C et al. Preparing medical imaging data for machine learning. Radiology 2020; 295: 4–15.
Mongan J, Halabi S S. On the centrality of data: data resources in radiologic artificial intelligence. Radiol Artif Intell 2023; DOI: 10.1148/ryai.230231.
Uribe S E, Issa J, Sohrabniya F et al. Publicly available dental image datasets for artificial intelligence. J Dent Res 2024; 103: 1365–1374.
Sengupta N, Sarode S C, Sarode G S, Ghone U. Scarcity of publicly available oral cancer image datasets for machine learning research. Oral Oncol 2022; 126: 105737.
Gonzalez R C, Woods R E. Digital Image Processing, Global Edition. New York: Pearson Education, 2018.
Rutkowska M, Hnitecka S, Nahajowski M, Dominiak M, Gerber H. Oral cancer: the first symptoms and reasons for delaying correct diagnosis and appropriate treatment. Adv Clin Exp Med 2020; 29: 735–743.
Essat M, Cooper K, Bessey A, Clowes M, Chilcott J B, Hunter K D. Diagnostic accuracy of conventional oral examination for detecting oral cavity cancer and potentially malignant disorders in patients with clinically evident oral lesions: Systematic review and meta-analysis. Head Neck 2022; 44: 998–1013.
Welikala R A, Remagnino P, Lim J H et al. Automated detection and classification of oral lesions using deep learning for early detection of oral cancer. Inst Electric Electron Engineers 2020; 8: 132677–132693.
Tanriver G, Soluk Tekkesin M, Ergen O. Automated detection and classification of oral lesions using deep learning to detect oral potentially malignant disorders. Cancers 2021; 13: 2766.
Chandrashekar H S, Kiran A G, Murali S, Dinesh M, Nanditha B. Oral images dataset. Mendeley Data 2021; DOI: 10.17632/mhjyrn35p4.2.
Islam M M, Alam K R, Uddin J, Ashraf I, Samad M A. Benign and malignant oral lesion image classification using fine-tuned transfer learning techniques. Diagnostics 2023; 13: 3360.
Rashid J, Qaisar B S, Faheem M, Akram A, Amin R U, Hamid M. Mouth and oral disease classification using InceptionResNetV2 method. Multimed Tools Appl 2024; 83: 33903–33921.
Piyarathne N S, Liyanage S N, Rasnayaka R M S G K et al. A comprehensive dataset of annotated oral cavity images for diagnosis of oral cancer and oral potentially malignant disorders. Oral Oncol 2024; 156: 106946.
Funding
Cairo University Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).
Author information
Authors and Affiliations
Contributions
NA: conceptualisation, data curation, resources, methodology, visualisation, writing – original draft, writing – review and editing. FMZ: data curation, methodology supervision, visualisation, writing - original draft, writing - review & editing. YSED: conceptualisation, formal analysis, methodology supervision, writing - original draft, writing - review & editing. NAA: resources, data curation, methodology supervision, visualisation, writing - original draft, writing - review & editing.
Corresponding author
Ethics declarations
The authors declare no conflicts of interest. The study was approved by the Research Ethics Committee of Faculty of Dentistry, Cairo University (approval number: 6-3-24). This research was done in accordance with the declaration of Helsinki. Participants signed an informed consent for their inclusion and the relevant data gathered in this research.
Rights and permissions
Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0.© The Author(s) 2025.
About this article
Cite this article
Ayman, N., Zahran, F., Safaa El-Din, Y. et al. An annotated clinical image dataset for AI classification of malignant and potentially malignant oral lesions. Br Dent J (2025). https://doi.org/10.1038/s41415-025-9007-6
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41415-025-9007-6







