Introduction

The Coronavirus Disease 2019 (COVID-19) is caused by the SARS-CoV-2 virus that can infect the upper and lower respiratory tract1. The oral cavity is one of the entrances of SARS-CoV-2 into the body2,3,4. The infection of this structure can initiate the development of various oral manifestations such as xerostomia and dysphagia2,3,4. The disease can remain asymptomatic or cause severe pathology5 (pneumonia, dyspnea, organ dysfunction) and death1. As of 27 August 2023, more than 770 million COVID-19 cases have been confirmed, and over 6.9 million deaths have been reported (WHO, https://www.who.int/publications/m/item/weekly-epidemiological-update-on-covid-19---1-september-2023).

Between December 2019 and May 2023, a total of 362,313 publications related to COVID-19 were published (PubMed). To make COVID-19-related literature more accessible, biomedical literature databases such as the open-access COVID-19 Open Research Dataset (CORD-19)6 and the NIH COVID-19 Portfolio (https://icite.od.nih.gov/covid19/search/) were created. The Allen Institute for Artificial Intelligence created the CORD-19 in collaboration with the National Institutes of Health and the White House Office of Science and Technology Policy, among others. In the CORD-19, the PDF documents are converted into machine-readable JSON files that can be manipulated using programming languages. CORD-19 had 694,366 records by December 2021. The CORD-19 final release was on June 2, 2022.

Data mining enables extracting information from large amounts of text to be analysed using text mining methods. Text mining facilitates information extraction, categorisation, grouping, trend analysis and visualisation7. The goal is to focus information searches to remove noise8 and to identify hidden knowledge in the literature9. Zengul et al. used text mining to classify literature from the NIH COVID-19 Portfolio. They found several topics of COVID-19 research, such as patient care and outcomes, epidemiologic modelling, mental health and detection10. Tandan et al. used data mining to identify the common pattern of symptoms in patients with COVID-197. Reddy et al. developed a biomedical platform to collect data on COVID-19 clinical risks. The use of this platform revealed the difficulty of extracting relevant clinical information through text mining and the need for feedback from experts in the field to obtain reliable results9.

Therefore, this study aimed to identify oral manifestations in patients with COVID-19 using text mining to facilitate extracting relevant clinical information from a large set of publications. Our study identified frequent COVID-19 oral manifestations that must be considered with other symptoms to diminish the risk of dentist-patient infection.

Methods

This is an observational and descriptive study.

Acquisition of data and text mining

A list of documents available in the CORD-19 database on January 31, 2022, was downloaded using the Kaggle notebook “Summary page COVID-19 risk factors” (https://www.kaggle.com/mlconsult/summary-page-covid-19-risk-factors). The notebook uses Pandas” built-in search technology. The notebook’s Python code was modified to find relevant articles by following this inclusion criteria: COVID-19 in abstract (‘COVID’, ‘-cov-2’, ‘cov2’, ‘ncov’, coronavirus), date of publication (January 1, 2020, and December 31, 2021), source of document (Elsevier, Medline, and PMC), and language (English, French, and Spanish). These were the exclusion criteria: preprints, reviews, communications, letters to editors, and documents lacking abstracts.

We search for keywords associated with oral health/dentistry. The search keywords were “buccal”, “dental”, “dentistry”, “dentist”, “odontology”, “oral” and “stomatognathic”. A search per keyword in the abstracts was performed independently and merged. Duplicate documents by title or abstract were removed. The notebook also scans the abstracts to print sentences (excerpts) containing the keywords to identify relevant articles rapidly. Finally, a text file containing “author”, “doi”, “title”, “abstract”, “excerpt”, “source”, “link”, and “publish time” was obtained. To curate the list of articles obtained, a text mining context analysis was performed to detect non-relevant (exclusion) keywords.

Curation of the list of articles

To extract from the text file only articles related to COVID-19 and oral health/dentistry, we used the “Mining COVID-19 scientific paper” notebook with modifications, which is available on Kaggle (https://www.kaggle.com/mobassir/mining-covid-19-scientific-papers). The text file was uploaded to the notebook. Stop words (commonly used words, usually articles, that search engines are programmed to ignore) were removed from the abstract sentences. Stop words commonly found in scientific publications (such as “objective”, “methods” and “conclusions”) were also eliminated. Using the Gensim Python library, a 5-g (five-word combination) was created11, and 494 “exclusion keywords” were identified. After visually scanning that the abstracts containing those “exclusion keywords” were not relevant, Python code was used to exclude them from the text file.

Classification of articles

The next step of the text mining analysis was to use a Latent Dirichlet Allocation (LDA) model to the abstracts to classify the list of articles curated into topics. LDA is a probabilistic dimensionality reduction technique that can classify a document into two or more mutually exclusive classes or topics based on the frequency of the document's words. A distribution of words characterises each topic. The topic probabilities provide an explicit representation of a document12. The model assigned a dominant topic to each document and its percentage contribution,thus, a corpus of documents can be classified into topics depending on their subject.

Author´s review of documents

To determine if the text mining process could classify the articles in the same categories that an expert would do, the authors reviewed each article title, abstract and excerpt in the curated text file to classify them. Finally, one category (“Oral manifestations in patients with COVID-19”) was selected for full-text review, and a database was created to summarise the results. The study workflow is summarised in Fig. 1.

Ethics approval

Due to the nature of the study based on text mining, Ethical Approval was not required.

Figure 1
figure 1

Workflow of the analysis. The documents available in the COVID-19 Open Research Dataset (CORD-19) were filtered using the “Summary page COVID-19 risk factors” and the “Mining COVID-19 scientific paper” notebooks available on Kaggle (https://www.kaggle.com/mlconsult/summary-page-covid-19-risk-factors and https://www.kaggle.com/mobassir/mining-covid-19-scientific-papers, respectively). The text mining classification of the documents was compared to the author's classification of the list based on title, abstract and excerpt. Finally, eight papers in the category “Oral manifestations in patients with COVID-19” were selected after full-text review to extract data.

Results

Between January 1, 2020, to December 31, 2021, the CORD-19 dataset contained 694,366 documents. Only 1,554 were oral health/dentistry related, even though the oral cavity is an entrance and reservoir of the SARS-CoV-213,14 and the virus causes oral lesions15,16,17. The main subjects of the documents retrieved cover biosafety and changes in dental practice during the pandemic, the psychological implications for dentists and patients and the affectation of dental education during the COVID-19 pandemic.

Before downloading data and using Python commands, we removed 163,685 preprints, 361,713 duplicate documents based on title and 406 duplicates based on the abstract. Zero records had missing abstract. A keyword search was performed on the remaining 168,562 documents, yielding 5,075 articles that were downloaded: 42 for “buccal”, 1,294 for “dental”, 433 for “dentistry”, 797 for “dentist”, 17 for “odontology”, 2,489 for “oral” and 3 for “stomatognathic”. The remaining articles were written in English, French and Spanish and were included in further analysis. A file containing the 5,075 documents is available in Supplementary Table 1.

After the first run of text mining applying the LDA model, 1,554 papers related to oral health/dentistry were classified into five topics (Fig. 2). The topics 0, 1, 2, 3 and 4 contained 68, 826, 185, 242 and 233 papers, respectively. The size of topic 1 remains between 700 and 800 documents in successive runs of the LDA model adjusted to generate 8 or 10 topics. Therefore, the initial classification of five topics was kept.

Figure 2
figure 2

Word clouds of the five topics obtained after applying the Latent Dirichlet Allocation (LDA model). The font of the 30 more often words per topic reflects the decreasing frequency of appearance of each word.

Table 1 shows the number of papers classified by topic and each topic's top 10 words or descriptive terms. According to the top 10 words, the five topic subjects were (0) the oral cavity as an entrance for the coronavirus into the body, (1) dental clinical practice and services offered during the pandemic, (2) the risk of COVID-19 transmission during clinical practice, (3) the psychological implications of COVID-19 for a dentist and the impact of COVID-19 in dental schools and education, (4) oral manifestations of COVID-19, oral hygiene and oral cancer. A multidimensional scaling analysis shows that topics 1 and 3 were the most similar.

Table 1 Top 10 keywords or descriptive terms per topic and number of articles per topic.

Agreement between text mining and author's review

The first round of text mining resulted in 1,554 papers that the authors reviewed by title, abstract and excerpt. The authors classified the papers into 17 categories based on the content (Table 2). Therefore, the five topics generated by the LDA model during the text mining analysis were divided into more specific subjects. The main categories obtained by the author's review were “Biosafety in dental practice” (n = 186, 17.4%) and “Dental practice during the pandemic” (n = 182, 17.0%).

Table 2 Description of the 17 categories obtained after the authors reviewed the title, abstract and excerpt. The count and frequency of articles per category are shown.

Finally, the full text of 21 papers in the category “Oral manifestations in patients with COVID-19” was reviewed due to its clinical relevancy. This category is part of topic 4. A database was created to summarise the results of eight studies that passed the full-text review and contain the frequency of the oral manifestations identified in patients with COVID-19 (Table 3). In these eight studies, the most common oral manifestations were xerostomia (patient’s sensation of dry mouth) reported in six studies, burning mouth and mouth pain or swelling reported in five studies, impaired taste and ulcers in four studies, and dysphagia in three studies (Supplementary Table 2). The detection methods of the oral manifestations were clinical staff diagnosis and self-diagnosis. The two most frequent oral manifestations diagnosed by clinical staff were salivary gland ectasia (n = 46, 2.0%) and U-shaped lingual papillitis (n = 35, 1.5%). The most frequent oral manifestations identified by auto-diagnosis were xerostomia (n = 389, 17.1%) and mouth pain or swelling (n = 265, 11.7%) (Table 4).

Table 3 Characteristics of the included studies and the population evaluated.
Table 4 Oral manifestations identified in COVID-19 patients regarding the detection method.

Discussion

The COVID-19 pandemic has profoundly impacted society, not only on a health level but also on social aspects such as economics and education. Lockdown measures to contain the COVID-19 spread led to the closure of businesses, schools, commercial stores, and government offices. The practice and teaching of Odontology have been affected during the pandemic due to the high risk of exposure to SARS-CoV-218,19,20. Cancellation of most dental appointments during the pandemic, except for emergencies, and the astringent safety measures that odontologists must have taken to resume patient care during economic reactivation affected odontology worldwide. Virtual classes, fewer patients in clinics, and the inability to conduct research with patients were challenges for dental schools21.

Due to the clinical relevancy, we selected the topic “Oral manifestations in patients with COVID-19”, among 17 categories, for a deeper analysis. The two most common oral alterations in patients with COVID-19 diagnosed by clinical staff were salivary gland ectasia and U-shaped lingual papillitis. The main oral manifestations identified by auto-diagnosis were xerostomia and mouth pain or swelling. Early infection of the salivary glands by SARS-CoV-2 may be associated with salivary gland ectasia. This hyperinflammation of the salivary gland is significantly related to protein C-reactive and LDH levels, both of which are COVID-19 severity biomarkers22. Patients’ evaluation has shown that salivary gland ectasia is associated with a more severe course of COVID-1923. U-shaped lingual papillitis is an inflammation of the tongue’s papillae. It could be caused by direct inflammation or drying of the oral mucosa or poor oral health24.

The presence of xerostomia in patients with COVID-19 has been reported previously25,26,27. Xerostomia is a sign of dehydration, which can occur secondary to infections28. Early infection of the salivary glands by the SARS-CoV-2, affecting their function and leading to changes in the flow and composition of saliva, is one possible cause of xerostomia in the early stages of COVID-1926. It has been hypothesised that SARS-CoV-2’s neuroinvasive and neurotropic potential causes xerostomia, as it can enter the nervous system via angiotensin-converting enzyme 229. It is also postulated that dysgeusia, a common manifestation of COVID-19, may be caused by xerostomia because changes in the amount and composition of saliva can cause gustatory dysfunction30. Xerostomia in patients with COVID-19 appears 1 or 2 days before other symptoms and may be helpful for early diagnosis and treatment implementation, preventing virus transmission.

The second most frequent oral manifestation by auto-diagnosis was mouth pain and swelling. The studies reviewed showed that patients primarily experienced pain or swelling in the salivary glands, jaw, or chewing. The muscles were the primary source of the oral pain. This is consistent with the observation that 76.4% of patients with COVID-19 had myalgia. They may feel muscle pain in the upper and lower jaw. Another source of pain can be headaches (70% of patients). Headache stimulates the trigeminal nerve, causing the release of neuropeptides that cause blood vessel dilation, inflammation, and pain13,14,15,16,17,31,32,33,34.

The data mining strategies, such as the one applied here, help analyse a vast amount of information; however, they require a human understanding of the keyword combinations and information printed by the method to avoid missing relevant documents or including thousands of irrelevant articles to reach the aim of the study.

The identification of 17 categories or subjects reflects the variety of themes covered within a single paper; for example, an article can focus on biosafety, psychological repercussions, and epidemiology simultaneously. When an article contains more than one paper, the LDA model assigns the document to more than one topic, although one of these topics may dominate the others. Therefore, the topics can overlap as topics 1 and 3 do in the present study. Another consequence of the subject variety within a paper is that each topic obtained by the model included more than one subject. For example, topic 4 was related to oral manifestations of COVID-19, oral hygiene and oral cancer.

Zengul et al. (2021) use a text mining approach to classify the NIH COVID-19 Portfolio as of November 2021 based on abstracts and titles. They identified 11 major research areas (topics), including “Epidemiologic Modelling”, “Mechanism of Disease”, “Protection/Prevention”, “Mental/Behavioural Health” and “Detection/Testing”. Additionally, they found that only five of the 11 abstract-based topics had a significant correlation with title-based topics and recommended revising the use of titles as the first step in developing an evidence-based medicine analysis portfolio10. None of the topics covered by Zengul et al. was related to oral health or dentistry.

Our analysis also reveals a challenge in selecting and classifying papers based on titles, abstracts, or excerpts, as it is common in systematic reviews and evidence-based medicine/dentistry. To build the corpus of text mining analysis in dentistry, we suggest using keywords or MeSH terms as the first filter of articles. The limitations of this study included a unique source of information; the literature analysis was based on the COVID-19 Open Research Dataset (CORD-19) database.

A perspective of this study is to apply text mining tools to identify the keywords that distinguished the 17 categories generated by the “author's review” and use those keywords as the basis for applying machine learning algorithms to construct a predictive model that can improve the selection and classification of the papers, reducing the time and necessity of curation of the documents.

Conclusions

Using text mining to identify oral health/dentistry-related articles in a large dataset of COVID-19 publications, we found that patients with COVID-19 experienced a wide range of oral changes. The most common oral manifestations were those affecting the salivary glands, such as xerostomia or salivary gland ectasia, followed by oral pain and inflammation.

Text mining is a helpful tool for analysing and sorting massive document datasets. However, it was necessary to combine text mining with the author´s review by title, abstract and excerpts to avoid the loss of data or the inclusion of unnecessary information.