Introduction

Nasopharyngeal carcinoma (NPC) is one of the most common malignancy of the head and neck, particularly in East and Southeast Asia1,2. Nonspecific early symptoms of NPC often lead to a delayed diagnosis, resulting in a poorer prognosis3,4. The 10-year survival rate of advanced cases typically ranges from 50% and 70%. In contrast, the 5-year survival rate for early NPC cases can approach 94%, highlighting the importance of early detection5,6,7,8,9. Thus, devising a technique for early detection of NPC during clinical examinations is the primary aim of this study.

Nasal endoscopy plays a crucial role in the early detection of NPC5,10. However, the accuracy of this examination relies heavily on the medical experience and expertise of the operators. Non-otolaryngology specialists, such as primary care doctors, emergency doctors, general practitioners, and paediatricians, may encounter difficulties in interpreting endoscopic images owing to professional obstacles and inadequate expertise. They often overlook the early signs of NPC and confuse them with those of nasal or nasopharyngeal diseases. This negligence frequently leads to missed diagnosis, misdiagnosis, and delayed referrals, resulting in patients missing critical treatment windows11,12,13. This issue is particularly pronounced in low- and middle-income countries where healthcare resources are limited and disease awareness is inadequate. Moreover, patients frequently overlook early NPC symptoms such as headaches and nasal congestion14,15. Concurrently, financial constraints also lead to delayed medical consultations, increasing the risk of missing crucial early diagnosis and treatment9,16. Consequently, to improve early detection rates and patient prognosis, it is essential to develop a novel, easy-to-use, and affordable method for the early detection of NPC using endoscopic images.

Intelligent diagnostic solutions based on smartphones have enormous potential in the medical field, especially given the rapid growth of smartphone capabilities and the widespread application of deep learning algorithms17. By analysing medical images immediately, these mobile health applications have demonstrated exceptional accuracy and efficiency in early disease identification and are becoming an emerging trend in healthcare18. For example, advances in the early diagnosis of disorders such as keratitis, biliary atresia, ear infections, skin cancer, and lupus have been made, with some applications outperforming human expert performance19,20,21,22,23,24. However, the creation of deep-learning smartphone applications for NPC remains an untapped research topic. Given the importance and complexity of identifying this malignancy, this gap highlights the importance and considerable potential of such applications. Therefore, exploring smartphone-based intelligent diagnostic methods for NPC promises to provide unique solutions for improving diagnostic accuracy and accessibility with significant scientific and practical significance.

In this study, we retrospectively collected 39,340 endoscopic white light images of 2134 NPC patients and 11,824 non-NPC patients without NPC from three centres in high-incidence areas of NPC and developed eight advanced deep learning models with different architectures. Through validation, testing, and comparison with nine experienced otolaryngologists, we ultimately developed a smartphone application based on the Swin Transformer model called Nose-Keeper to improve the accuracy and efficiency of healthcare workers (especially primary healthcare providers) in diagnosing NPC, raise public’s awareness of NPC, and refer patients to professional medical institutions in a timely manner.

Results

Internal evaluation of various deep learning models

Table 1 presents the average overall accuracy, standard deviation, and corresponding 95% CI of the eight models for the internal dataset. The results indicate that the eight models developed achieved encouraging results in diagnosing seven types of nasal endoscopic images using transfer learning strategies and a large-scale dataset. The average overall accuracy of all the models exceeded 0.92. SwinT performed the best among the eight models, with an average overall accuracy of 0.9515. ResNet had the lowest average overall accuracy among the eight models, reaching 0.9221. From the standard deviation perspective, the most stable model was MaxViT, followed by SwinT. In addition, PoolF exhibited the highest standard deviation and the worst stability. In addition, Supplementary Table 1 reports the time required for different models during the experimental process.

Table 1 Average overall accuracy of different models

Table 2 reports the precision, sensitivity, specificity, and f1-score of eight models for diagnosing NPC. The experimental results showed that the developed models almost exceeded 0.9900 for all four indicators of NPC, except for the precision of ResNet. For sensitivity, SwinT achieved the best results, reaching 0.9984 (±0.0023) (0.9939–1.0000). For precision, specificity, and F1-score, PoolF achieved the best results, reaching 0.9959 (±0.0034) (0.9892–1.0000), 0.9992 (±0.0006) (0.9980–1.0000), and 0.9969 (±0.0012) (0.9945–0.9993), respectively. Table 3 reports the performance of the eight models developed in diagnosing five non-NPC diseases and normal samples. Based on the results of evaluation metrics and the potential impact of model architecture on the performance of external testing, we chose SwinT, PoolF, Xception, and ConvNeXt as candidate models for the smartphone application.

Table 2 NPC Diagnosis performance (Average, standard deviation and 95% CI) of various models
Table 3 Non-NPC diagnosis performance (Average, standard deviation and 95% CI) of various models

We initialised each candidate model parameter using the best weight from the training set and then used the corresponding internal validation set to determine the optimal temperature for each candidate model when using a temperature scaling strategy. Figure 1 shows the result changes of calibration metrics (Brier-score and Log-Loss) for each model on the internal test set. The experimental results indicate that the pre- and post-calibration results (Fig. 1a, b) of SwinT were the best among the candidate models. In comparison, the results of the remaining candidate models were obviously inferior to SwinT (Fig. 1c–h).

Fig. 1: The result changes of calibration metrics of four candidate models in diagnosing different categories.
figure 1

The result changes of Brier-Score and Log-Loss of each model for each category were plotted. a Brier-Score of SwinT. b Log-Loss of SwinT. c Brier-Score of Xception. d Log-Loss of Xception. e Brier-Score of PoolF. f Log-Loss of PoolF. g Brier-Score of ConvNeXt. h Log-Loss of ConvNeXt. Particularly, “ALL” in X-axis means the calibration performances of the entire external test set.

External evaluation of each candidate model

Four candidate models were tested using the external test set from LZH to further evaluate their performances in real-world clinical settings. Figure 2 shows the overall accuracy and confusion matrix of the four candidate models on the external test set. All the predicted results were shown after calibration. The confusion matrix was used to analyse the sensitivity and specificity of each model for a specific category. The experimental results indicate that SwinT (Fig. 2a) achieved a state-of-the-art performance, far superior to Xception (Fig. 2b), PoolF (Fig. 2c) and ConvNeXt (Fig. 2d). Concretely, it achieved the highest overall accuracy, reaching 92.27% (95% CI: 90.66%–93.61%). For NPC, the sensitivity and specificity of SwinT reached 96.39% (95% CI: 92.74%–98.24%) and 99.91% (95% CI: 99.47%–99.98%), respectively (Fig. 2a). For non-NPC categories, the sensitivity and specificity of the SwinT exceeded 86.00% and 95.00%, respectively. Figure 3 shows the ROC curves of the four candidate models. In terms of the ROC curve, the SwinT was also found to be the best model (Fig. 3a). SwinT’s AUCs for all seven categories were greater than 0.9900. For NPC, SwinT’s AUC reached 0.9999 (95% CI: 0.9996–1.0000). In contrast, the AUCs of Xception (Fig. 3b), PoolF (Fig. 3c), and ConvNeXt (Fig. 3d) for NPC were 0.9994 (95% CI: 0.9985–0.9999), 0.9916 (95% CI: 0.9856–0.9958), and 0.9989 (95% CI: 0.9976–0.9997), respectively,

Fig. 2: Confusion matrix of four candidate models in the images from LZH.
figure 2

The figure also reports the overall accuracy of each model on the external test set and the corresponding confidence interval. a The confusion matrix of the SwinT. b The confusion matrix of the Xception. c The confusion matrix of the PoolF. d The confusion matrix of the ConvNeXt.

Fig. 3: Receiver operating characteristic curve (ROC) and optimum Youden-index results of candidate models.
figure 3

a The ROC and optimum Youden-index results of SwinT. b The ROC and optimum Youden-index results of Xception. c The ROC and optimum Youden-index results of PoolF. d The ROC and optimum Youden-index results of ConvNeXt.

However, compared to the internal test set, the performances of the four candidate models decreased on the external test set. We initially extrapolated that this phenomenon might be caused by the different imaging equipment and image acquisition process in the external test set. Figure 3 also presents the optimum Youden index results for each model for the different categories to compare the performance of the different models further. Experimental verification showed that SwinT was the best model for diagnosing NPC, with a Youden index of 0.992. SwinT also demonstrated excellent performance in diagnosing AH, AR, and CRP with Youden indices of 0.991, 0.925, and 0.924, respectively. The results fully demonstrated the application potential of SwinT in diagnosing different categories. Overall, the experimental results for the external test set indicated that SwinT performed the best among the four candidate models. Therefore, SwinT was chosen to deploy the smartphone application.

Robustness analysis of the SwinT

Figure 4 reports the results of the robustness analysis of the SwinT. Figure 4a illustrates examples from the external test set using various transformation strategies. Figure 4b–f details the performance metrics (overall accuracy, sensitivity, precision, specificity, and f1-score) of SwinT across the 12 enhanced datasets. The SwinT showed good robustness to rotation changes, likely benefiting from the training of the model with random rotation augmentations. For Gaussian blur transformation, the performance of the model decreased with an increase in the blur level, but the overall performance remained relatively stable. When using a slight decrease (Brightness I) or increase (Brightness II) in brightness, the accuracy of the model was minimally affected by interference, and its overall performance remained relatively steady. However, an excessive increase in brightness caused a significant decrease in model performance, affecting the sensitivity (AH, CRP, NPC, and RHI), precision (AH), F1-score (AH, CRP, and NPC), and specificity for AH. Similarly, when applying slight decreases (Saturation I) and increases (Saturation II) in the saturation, the accuracy of the model was not significantly affected. However, significant saturation enhancement (Saturation III) led to notable deviations in accuracy, particularly affecting the sensitivity for AH, precision for CRP and RHI, F1-score for AH and CRP, and specificity for CRP and RHI.

Fig. 4: Performance comparison of SwinT on the external test dataset using 12 image transformations.
figure 4

a Examples of 12 image transformations. b Sensitivity of SwinT for external test set. c Precision of SwinT for external test set. d F1-score of SwinT for external test set. e Specificity of SwinT for external test set. f Overall accuracy of SwinT for external test set.

Heatmap and performance comparison between the SwinT and otolaryngologists

Figure 5 visually explains the model’s internal decision-making mechanism and represents the results of the human-machine comparison experiment. Figure 5a shows the heat maps generated by SwinT using the Grad-CAM algorithm for seven types of nasal endoscopic images. Experimental results indicated that SwinT could effectively focus on the key areas of each type of image. Visually, the colour distribution of the heat map conformed to the professional insights of otolaryngologists. For NPC, Grad-CAM effectively helped SwinT to highlight the lesion area.

Fig. 5: The heatmaps and Human-machine comparison results.
figure 5

a the heatmaps of different nasal endoscopic images generated by Grad-CAM. b Comparison of the sensitivity between SwinT and otolaryngologists. c The ROC curve of SwinT, the optimum Youden-index results of SwinT and the otolaryngologists.

Figure 5b shows the performance of SwinT and the nine otolaryngologists in diagnosing different diseases. In Fig. 5b, the closer the colour was to green or yellow, the more accurate the doctor or model was in diagnosing the disease. The average sensitivity of the nine physicians for NPC was 0.8927. Among them, the best doctor to diagnose NPC was a doctor with eight years of clinical experience, with a sensitivity of 0.9433, and the worst doctor was a doctor with three years of clinical experience, with a sensitivity of 0.8247. Obviously, SwinT outperformed all experts in diagnosing NPC. Furthermore, for AR, SwinT outperformed all experts. Because SwinT was prone to misjudge CRP as an AH, its performance was not as good as that of experts. For AH, the SwinT was superior to most other physicians. The SwinT was superior to some doctors in terms of the DNS, NOR, and RHI scores. Figure 5c and Supplementary Fig. 1 show the performance differences between SwinT and the nine otolaryngologists in diagnosing various types of endoscopic images in more detail. Based on the optimum Youden index results of SwinT and otolaryngologists, we concluded that SwinT outperformed all otolaryngologists in diagnosing AH, AR, NPC, and RHI. When diagnosing the remaining three types of endoscopic images, SwinT was slightly inferior to some otolaryngologists (clinical experience: from five years to nine years).

Display of the smartphone application

The main function of Nose-Keeper (Fig. 6) is to read real-time images captured by a nasal endoscope connected to an Android phone with a Micro USB interface or to load local nasal endoscope images (for example, the user can capture images from other endoscope devices, and then the image is uploaded to the smartphone album) (Fig. 6a). Subsequently, by clicking the one-click detection button (Fig. 6b) on the application, the user can obtain the diagnosis results of the image using the AI model. In addition, we listed a heat map corresponding to the original image on the results page to enhance the security of the application and to remind users to pay attention to the diseased area actively (Fig. 6c). Nose-Keeper can also read multiple endoscopic images simultaneously and use a voting mechanism to further improve prediction accuracy (Supplementary Note 6). In particular, we listed reference images of various diseases and some common medical sense to improve the users’ understanding of diseases and medical procedures. Such an intelligent application will potentially help the majority of primary care providers who lack clinical experience and professional knowledge in diagnosing nasal diseases, and the public residing in high-risk areas will primarily judge whether the captured nasal endoscopic image contains NPC, common nasal cavities, and nasopharynx diseases. We tested the running speed of Nose-Keeper using four different Android smartphones (i.e. Xiaomi 14, Xiaomi 12S Pro, HUAWEI nova 12, and HUAWEI mate 60) at a network speed of 100 Mb/S. The results show that the time consumption is ~0.5 s to 1.1 s. For more detailed feature introductions and user pages of Nose-Keeper, please refer to Supplementary Fig. 2 and Supplementary Video 1.

Fig. 6: The application page and usage process of Nose-Keeper.
figure 6

The application process mainly includes login, upload image, submit identification, and finally output detection result report. a The home page after login. b The page after uploading the image. c The result page after data processing.

Discussion

To the best of our knowledge, this study is the first to develop a smartphone application based on a deep learning model named Nose-Keeper to diagnose NPC and non-NPC effectively. To ensure the practicality of Nose-Keeper, we retrospectively collected 6,014 NPC white-light endoscopic images and 33,326 white-light endoscopic images of common diseases of the nasal cavity, nasopharynx, and normal nasal cavities from three hospitals and trained eight different deep learning models. In this study, to shorten the training time of the model and improve the generalisation of the model as much as possible, we used a popular transfer learning strategy. Through extensive evaluation and testing (including model metric comparison, model calibration, robustness analysis, and human-machine comparison), we found that the developed SwinT reached state-of-the-art application potential and encapsulated it into Nose-Keeper. Compared to nine otolaryngologists with different diagnostic experiences, Nose-Keeper outperformed all otolaryngologists in diagnosing NPC. For diseases other than NPC, the diagnostic performance of the model was comparable to that of most physicians. In addition, we used the Grad-CAM algorithm in the application to visually display the areas in the image that affect the decision-making results of the model, which effectively reminds users to pay attention to the lesion area. Because the Nose-Keeper is deployed on a cloud server, its operation is not affected by the hardware. Users only need to use the Internet to obtain diagnostic results from the Nose-Keeper in real-time.

Previously, researchers in the field of computer vision used CNNs to process various images. Researchers have recently begun to focus on the potential of ViTs in computer vision. Initially, ViTs were applied to natural language processing systems such as the recently popular large language models. Compared with traditional CNNs, ViTs rely on a self-attention mechanism to achieve better results in image processing. Considering that ViTs are attracting more and more attention and may be more suitable for processing endoscopic images, we tentatively trained three ViTs and three CNNs as well as two hybrid models in this work to obtain the best model. The experimental results demonstrated that most of the selected ViTs were better than convolutional neural networks for diagnosing nasal endoscopic images. Notably, we found that SwinT achieved state-of-the-art performance on both the internal and external test sets. From the perspective of its internal modelling mechanism, SwinT is essentially a hierarchical Transformer that uses shifted windows. SwinT constructs a hierarchical representation by starting from small-sized patches and gradually merging neighbouring patches in deeper Transformer layers25. By employing the shifted window-based self-attention, it only calculates self-attention within a local window, which greatly reduces the computational complexity. Meanwhile, in consecutive SwinT blocks, the shifted windowing scheme allows for cross-window connection, i.e. the model can shift the windows in a certain pattern to ensure that the lesion feature information flows between different windows. The unique mechanism of the SwinT enables it to maintain the advantages of the ViTs in modelling long-range dependencies while effectively capturing local information in nasal endoscopy images, thereby improving the accuracy in diagnosing NPC and non-NPC diseases. These findings provide practical guidance for nasal endoscopy researchers.

The World Health Organization (WHO) Global Observatory for eHealth (GOe) defines mHealth as medical and public health practice supported by mobile devices26. mHealth has the potential to change healthcare and support public health and primary healthcare27. With the booming development of smartphones in our daily lives, the combination of advanced medical technology and mHealth to manage diseases has become an unstoppable trend28. Nowadays, smartphones have become an indispensable part of daily life, and mHealth applications have found a place in healthcare systems29. On a global scale, the penetration rate of smartphones reached 68% in 202230. Especially in developing countries, the number of smartphone owners is constantly increasing, leading to significant social and economic changes31. In 2022, nearly nine in 10 internet users in Southeast Asia located in the high incidence area of NPC will use smartphones this year32. All these facts prompted us to develop a smartphone application called Nose-Keeper which utilise artificial intelligence and may drive the development of the primary healthcare industry. With the support of the Internet, especially in developing countries and areas with high incidences of NPC, Nose-Keeper can be used to improve the care of nasal health and reduce medical costs. It can be predicted that with the further increase of the penetration rate of smartphone devices and the Internet in the future, the availability of Nose Keeper will be greatly improved, and it can effectively provide as many patients with fast, convenient and non-professional primary diagnosis services.

NPC is a severe public health problem in the underdeveloped Southeast Asian countries. In Indonesia, 13,000 new cases of NPC are reported every year33. NPC is the fifth most common cancer and ninth most common cancer in Malaysia and Vietnam34,35. Unfortunately, NPC is characterised by its high invasiveness and early metastasis. Owing to the insufficient number of experts in these countries, lack of sufficient experience among most primary care providers, and lack of medical awareness and adequate financial income among patients themselves, misdiagnosis and delayed diagnosis often occur, seriously threatening the lives of patients. Many patients with NPC are at an advanced stage when they first seek treatment36. Clinically, NPC is considered to be the result of the interaction between Epstein–Barr virus (EBV) infection and genetic and environmental factors (such as drinking, smoking, and eating salted fish)37. Therefore, during the actual medical consultation process, general practitioners or doctors in primary care institutions can comprehensively consider the diagnostic results, risk factors, and EB antibody results of the Nose-Keeper, thereby further enhancing the reliability of the diagnostic results and improving timely referrals of patients with NPC. The general public can use Nose-Keeper as a daily nasal health management tool. Specifically, the public can purchase high-resolution electronic nasal endoscopes on the market, learn about and use the endoscope under the instructions and professionals, and regularly upload endoscopic images to Nose-Keeper. In addition, our experiments show that Nose-Keeper is even more sensitive to NPC than a clinician with nine years of experience; therefore, it may also be used as an auxiliary tool to reduce the work stress of experts. Overall, AI-based Nose-Keeper has the following significant advantages in early detection of NPC in clinical practice: Firstly, unlike human otolaryngologists, AI models do not suffer from fatigue or variability in performance, which can usually be caused by human factors like physical and mental stress. AI models can consistently implement the same standards across all cases, which is especially beneficial in mass screening scenarios where consistency is critical. Secondly, AI models can analyze large datasets more efficiently and fastly than humans. The large-scale dataset provides the models with rich learning materials, enabling them to learn how to effectively detect and diagnose ENT diseases in a short period of time. In contrast, human clinicians require extensive learning, training, and long-term experience accumulation. Thirdly, AI models, particularly deep learning models, are adept at recognising complex patterns in endoscopic images, such as subtle variations and early characteristics of ENT diseases that experienced otolaryngologists might ignore. This capability is crucial in the early detection of diseases from medical images, where early signs may be easily overlooked. Fourthly, AI can significantly reduce the diagnosis time, which is critical in managing a large number of patients. AI can process and analyze data much faster than human experts, enabling quicker decision-making that can be vital in clinical settings where time is of the essence, such as in emergency care.

Recently, several deep-learning studies on NPC have been published. In 2018, Li et al.38 used 28,966 white-light endoscopic images to develop a deep learning model for detecting the normal nasopharynx, NPC, and other nasopharyngeal malignant tumours. Their model performance surpassed that of experts, with an overall accuracy of 88.7%. In 2022, Xu et al.39 developed a deep learning model using 4,783 nasopharyngoscopy images to identify NPC and non-NPC (inflammation and hyperplasia). In 2023, He et al.40 developed a deep learning model using 2,429 nasal endoscopy video frames and an algorithm named You Only Look Once (YOLO) for real-time detection of NPC in endoscopy videos. The sensitivity of their system for the detection of NPC was 74.3%. Compared with these studies, our work has the following highlights. First, our dataset consisted of 39,340 white-light endoscopic images containing both NPC and six categories of non-NPCs. The scale of our dataset is the largest ever, and the images were from three hospitals in areas with a high incidence of NPC, indicating that our dataset is more representative of real-world data. Second, by verifying many deep learning models with different architectures, we found that the Vision Transformers using the transfer learning strategy were better than the convolutional neural network using the transfer learning strategy for diagnosing NPC, providing model development guidance for subsequent researchers. The third and most important point is that we developed the Nose-Keeper. This is the first smartphone-based cloud application for NPC diagnosis in the world and is convenient to operate. To ensure the safety of the Nose-Keeper, we used the Grad-CAM algorithm to explain the decision-making process of the Nose-Keeper model visually and compared it with that of nine otolaryngologists. In addition, Nose-Keeper can identify five common diseases in daily life that are similar in appearance and clinical manifestations to NPC. These category settings make Nose-Keeper more reasonable and reliable. In fact, in areas with a high incidence of NPC, the number of non-NPC patients is far greater than the number of NPC patients. Therefore, Nose-Keeper can also provide convenient primary diagnostic services for numerous non-NPC patients and reduce their concerns about NPC, thereby alleviating the burden on the local medical system. Likewise, people in low-incidence areas of NPC can use Nose-Keeper as a nasal health management tool in their daily lives.

The application of the Nose-Keeper in healthcare in developing countries is expected to have a significant positive impact. Its primary advantage is that it significantly improves the early diagnosis of NPC and other diseases. Given that developing countries may lack skilled medical professionals and advanced medical facilities, the preliminary screening features of smartphone applications can significantly enhance early detection rates. This is especially important for decreasing misdiagnoses or missed diagnoses, which helps save medical resources and reduces reliance on more expensive and sophisticated treatment plans. In addition, the Nose-Keeper can be used as an educational tool to raise public knowledge of NPC and other frequent nasopharyngeal disorders, particularly in places with limited medical education resources. Primary healthcare professionals (PHCPs) play a significant role in developing countries. A Nose-Keeper is an essential auxiliary tool for accurate and effective disease diagnosis. In the future, the Nose-Keeper will incorporate advanced artificial intelligence algorithms, improve its ability to recognise a broader spectrum of nasal and ENT ailments and serve as a comprehensive diagnostic tool for various nasal health issues. Simultaneously, artificial intelligence will be employed to deliver personalised health advice based on user-specific data, such as lifestyle and environmental changes, to lower the risk of illness. By analysing anonymised aggregate data from users, Nose-Keeper can discover trends and patterns in NPC and other ENT disorders, providing helpful information for public health policies and resource allocation, particularly in areas with high disease incidence. Finally, recognising the value of education in health management, the Nose-Keeper will undertake an education campaign on NPC and general nasal health to raise public awareness and early detection rates. The adoption of these functions and goals will allow Nose-Keeper to play an increasingly crucial role in the healthcare systems of developing countries.

We acknowledge several limitations in our study. First, the absence of prospective testing due to dataset constraints may influence the confidence healthcare professionals and common users have in the Nose-Keeper. Second, Nose-Keeper currently lacks an image quality control function, which can result in unreliable outputs when processing images of substandard quality, such as those with blur, uneven lighting, or improper angles. Third, the datasets used were collected with professional equipment in medical settings. Therefore, biases inherent in our dataset, such as variations in image quality, imaging protocol, and imaging view, might limit the applicability of Nose-Keeper in non-clinical environments. There is a need to validate the performance of our model on images captured by household endoscopes, which have not yet been included in our dataset. Fourth, Nose-Keeper requires internet connection to link up with the cloud-based AI model for lesions detection. For some developing countries that have not yet popularize the Internet, Nose-Keeper’s availability is relatively limited. Therefore, there is a need to develop lightweight AI models to eliminate the need for high-performance devices in our next-generation Nose-Keeper, so as to achieve both cloud and local deployment simultaneously. Fifth, all images used in this study were obtained from Chinese hospitals. For safety reasons, it is crucial to use endoscopic data from people living in other high-risk areas (Such as Vietnam, Indonesia and Malaysia) to test Nose-Keeper. Future work will focus on prospective testing, developing an independent image quality control system, collecting images of different people using household nasal endoscopes and constructing a larger dataset including various image qualities. Additionally, recognising that NPC diagnosis must also consider a patient’s clinical information like gender, age, dietary habits, and genetic factors, future studies will aim to develop multimodal deep learning models that integrate these variables.

The Transformer model has been widely applied in multimodal learning tasks and has achieved great success, becoming the backbone of multimodal models41. At the same time, using multimodal clinical information for medical diagnosis has become a common practice in modern medicine42. Especially, our work has fully validated that the Transformer model can effectively diagnose diseases such as NPC. Based on these insights, we have developed a strategic roadmap for integrating multimodal data (Fig. 7, Supplementary Note 4) to enhance the practicality of Nose-Keeper in the future. We believe that with the enrichment and improvement of multimodal endoscopic datasets, along with advancements in smartphones and deep learning technologies, Transformer-based multimodal models will be able to diagnose nasal and nasopharyngeal lesions in patients. More specifically, the future version of Nose-Keeper will use Transformer-based multimodal models to integrate clinical symptoms and risk factors (text input) of patients with endoscopic images (image input) from different regions of the nasal cavity (such as inferior turbinate, middle turbinate and nasopharynx), effectively detecting whether patients have NPC and other nasal diseases. Especially, the fused multimodal features can be further utilised to generate image captions, which can help PHCPs and the public better understand various lesions43. In addition, another key point of the future Nose-Keeper is to utilise model compression technology (Such as knowledge distillation, model pruning and model quantization) to reduce the device performance requirements of applications, thereby achieving local deployment and efficient cloud deployment.

Fig. 7: A Transformer-based strategic roadmap for enhancing the diagnostic capability and clinical utility of Nose-Keeper.
figure 7

On this image, the multi-modal information fusion strategy combined with clinical information in subsequent studies is shown, providing a research direction for relevant researchers.

In summary, this study represents a significant advancement in NPC diagnostics through developing Nose-Keeper, a smartphone cloud application based on cutting-edge Swin Transformer technology. Our findings demonstrate that Nose-Keeper surpasses the diagnostic sensitivity of nine professional otolaryngologists in diagnosing NPC. This was achieved by analysing a diverse and extensive nasal endoscopic dataset from multiple centres, supported by testing eight different deep learning models. Nose-Keeper’s user-friendly interface enables both medical professionals and the general public to upload endoscopic images and receive real-time AI-based preliminary screening results. Nose-Keeper is especially crucial in regions with limited access to specialized nasal endoscopy services since it not only improves primary diagnosis accuracy but also enhances awareness of NPC among primary healthcare providers and residents in high-risk areas. Furthermore, the study lays groundwork for future research into mobile healthcare and cancer detection, expanding the potential impact of DL-based smartphone app across other medical fields.

Methods

This study was divided into three main parts: collecting datasets, constructing deep learning models, and developing mobile applications. Figure 8 illustrates this workflow. The details of the study were comprehensively presented in the following subsections.

Fig. 8: The flow chart of our work.
figure 8

This flowchart illustrates in detail the process of dataset construction and how to develop Nose-Keeper using deep learning models. a The collection process of datasets. b The development process of deep learning models and our smartphone application.

Construction of the multi-centre dataset

In this study, we reviewed and constructed a dataset from three hospitals located in high-risk areas of NPC. We retrospectively collected numerous white-light nasal endoscopic images of patients with NPC from the Department of Otolaryngology of the Second Affiliated Hospital of Shenzhen University (SZH) and the Department of Otolaryngology of Foshan Sanshui District People’s Hospital (FSH) between 1 January 2014 and 31 January 2023. Given that the early clinical symptoms of NPC (such as headache, cervical lymph node enlargement, nasal congestion, and nosebleeds) are similar to those of common diseases of the nasal cavity and nasopharynx44, and rhinosinusitis, allergic rhinitis, and chronic sinusitis may be risk factors for NPC45,46,47, we collected white-light nasal endoscopic images of non-NPC patients visiting SZH and FSH from the same period to develop deep learning models. In addition, Leizhou People’s Hospital (LZH) provided nasal endoscopic images of patients who visited the Department of Otolaryngology between 1 January 2015 and 31 April 2022. From an application perspective, including the images of non-NPC patients in the dataset can effectively improve the comprehensiveness and accuracy of the results of the deep learning model for diagnosing nasal endoscopic images. The collected images were divided into seven categories (Fig. 9): NPC (Fig. 9a), adenoidal hypertrophy (AH) (Fig. 9b), allergic rhinitis (AR) (Fig. 9c), chronic rhinosinusitis with nasal polyps (CRP) (Fig. 9d), deviated nasal septum (DNS) (Fig. 9e), normal nasal cavity and nasopharynx (NOR) (Fig. 9f) and rhinosinusitis (RHI) (Fig. 9g). Table 4 presents the detailed characteristics of the dataset.

Fig. 9: Some typical endoscopic images of different diseases.
figure 9

These endoscopic images are given from different angles and parts for each disease type. a Nasopharyngeal carcinoma (NPC). b Adenoidal hypertrophy (AH). c Allergic rhinitis (AR). d Chronic rhinosinusitis with nasal polyps (CRP). e Deviated nasal septum (DNS). f Normal nasal cavity and nasopharynx (NOR). g Rhinosinusitis (RHI).

Table 4 The characteristic of datasets from various hospitals, Male: M; Female: F

This study was approved by the Ethics Committee of the Second Affiliated Hospital of Shenzhen University, the Institutional Review Board of Leizhou People’s Hospital and the Ethics Committee of Foshan Sanshui District People’s Hospital (reference numbers: ‘BY-EC-SOP-006-01.0-A01’, ‘BYL20220531’ and SRY-KY-2023045’) and adhered to the principles of the Declaration of Helsinki. Due to the retrospective nature of the study and the use of unidentified data, the Institutional Review Boards of SZH, FSH and LZH exempted informed consent. Supplementary Note 5 presents more detailed ethics declarations and procedures.

Diagnostic criteria of the nasal endoscopic images

In this study, to ensure the accuracy of the endoscopic image labels, three otolaryngologists with over 15 years of clinical experience set the diagnostic criteria based on practical clinical diagnostic processes and reference literature. Specifically, the expert combined each patient’s endoscopic examination results with the corresponding medical history, record of clinical manifestations, computed tomography results, allergen testing (skin prick tests or serum-specific IgE tests), lateral cephalograms, histopathological examination results, and laboratory test results (such as nasal smear examination) to further review and confirm the diagnostic results of the existing nasal endoscopic images of each patient. A diagnosis based on the aforementioned medical records was considered the reference standard for this study. Our otolaryngologists independently reviewed all data in detail before any analysis and validated that each endoscopic image was correctly matched to a specific patient. Patients with insufficient diagnostic medical records were excluded. During the review process, when an expert doubted the diagnostic results of a particular patient, the three experts jointly made decisions on the patient’s medical records and various examination results to determine whether to include the patient in this study. The standard diagnosis for seven types of nasal endoscopic images in the dataset was as follows: (1) NPC: providing the standard diagnostic label for patient images directly based on histopathological examination results48,49; (2) Rhinosinusitis: further combining the patient’s medical history, clinical manifestations, and computed tomography examination50; (3) Chronic rhinosinusitis with nasal polyps: further combining the patient’s medical history, clinical manifestations, computed tomography results, and pathological tissue biopsy results51,52; (4) Allergic rhinitis: further combining the patient’s medical history, clinical manifestations, and allergen testing or laboratory methods53,54,55; (5) Deviation of nasal septum: further combine the patient’s medical history and clinical manifestations and secondary analyse and evaluate the shape of the nasal septum56. (6) Adenoid hypertrophy: further combine the patient’s medical history, clinical manifestations, or lateral cephalograms57,58. (7) Normal nasal cavity and nasopharynx: further combination of the patient’s medical history and clinical manifestations. The nasal mucosa of a normal nasal cavity should be light red, and its surface should be smooth, moist, and glossy. The nasal cavity and nasopharyngeal mucosa show no congestion, edema, dryness, ulcers, bleeding, vasodilation, neovascularization, or purulent secretions. Table 5 details the distribution of image categories across hospitals.

Table 5 The description of all nasal endoscopic white light images used in this study

Deep transfer learning models

Transfer learning aims to improve the model performance on new tasks by leveraging pre-learned knowledge of similar tasks59. It has significantly contributed to medical image analysis, as it overcomes the data scarcity problem and saves time and hardware resources.

In this study, we effectively combined deep learning models, which are popular in artificial intelligence, with this powerful strategy. To build an optimal NPC diagnostic model, we studied Vision Transformers (ViTs), convolutional neural networks (CNNs), and hybrid models based on the latest advances in deep learning. Among them were (1) ViTs: Swin Transformer (SwinT)25, Multi-Axis Vision Transformer (MaxViT)60, and Class Attention in Image Transformers (CaiT)61. These models were selected for their ability to model long-range dependencies and their adaptability to various image resolutions, which are crucial for medical image analysis. These models represent the latest shifts in deep learning from convolutional to attention-based mechanisms, providing a fresh perspective on feature extraction. (2) CNNs: ResNet62, DenseNet63, and Xception64. CNNs have gradually become the mainstream algorithm for image classification since 2012, and have shown very competitive performance in medical image analysis tasks65. ResNet and DenseNet excel in addressing the vanishing gradient problem and strengthening feature propagation. Xception achieves a good trade-off between parameter efficiency and feature extraction capability using depth-wise separable convolutions. (3) Hybrid Models: PoolFormer (PoolF)66 and ConvNeXt67. PoolFormer enhances feature extraction by leveraging spatial pooling operations, while ConvNeXt incorporates ViT-inspired design concepts into CNNs to improve model performance, particularly in capturing global context through enhanced architectures. These architectures help improve model performance in downstream tasks by effectively utilising the advantages of CNNs and ViTs.

We then initialised the eight architectures using pretrained weights obtained by classifying the large natural image dataset ImageNet68. Because the original number of nodes of the classifiers of these networks was 1000, we reset the number of nodes of their classifiers to seven to fit our dataset. After completing initialization and adjusting the classifier, we did not choose to fine-tune some layers but instead performed comprehensive training on the entire model from scratch. Moreover, we performed probability thresholding based on Softmax.

Explainable artificial intelligence in medical image

In medical imaging, Explainable Artificial Intelligence (XAI) is critical because it fosters trust and understanding among medical practitioners and facilitates accurate diagnosis and treatment by elucidating the rationale behind AI-driven image analysis. In this study, we used Gradient-weighted Class Activation Mapping (Grad-CAM)69 to generate a corresponding heatmap. Red indicates high relevance, yellow indicates medium relevance, and blue indicates low relevance. Grad-CAM helps to visualise the regions of an image that are important for a particular classification. This is crucial in medical image classification, as it helps people understand which parts of the image contribute to model decision-making and validates whether the model focuses on disease-related features. By providing visual explanations through heat maps, Grad-CAM can help build trust among medical practitioners and the public regarding the decisions made by AI systems. However, Grad-CAM depends on the model’s architecture and may not provide more detailed insights. Besides, in real-time applications or scenarios requiring quick analysis, the computational demands of Grad-CAM might hinder its practicality.

Development process of models and smartphone applications

All the nasal endoscopic images were divided into two parts. The first part contained 38,073 images from SZH and FSH, which were used as the development dataset for training and validating the performance of the model. The development dataset was further divided into three parts in a 7:1:2 ratio, i.e. internal training, internal validation, and internal test sets. The second part contained 1267 images from the LZH, which were used as an external test set to test the performance of the model in real-world settings and verify the robustness of the model.

Before training the various networks, we resized all images to 224 × 224 × 3. Subsequently, the images were normalised and standardised using the mean [0.2394,0.2421,0.2381] and standard deviation [0.1849, 0.28, 0.2698] of the three channels. To avoid degradation of model performance when transitioning from the development set to the external test set, we used online data augmentation and early stopping strategies as well as model calibration techniques. Concretely, we first utilised the Transformers Library provided by Pytorch to automatically transform (RandomRotation, RandomAffine, GaussianBlur and Color Jitter) the image inputs during training to improve the robustness and generalisation ability of the models. During the training process, the loss functions of all models uniformly used the cross-entropy loss function, and we employed the AdamW optimiser with a 0.001 initial learning rate, β1 of 0.9, β2 of 0.999, and weight-decay of 0.0001 to optimise eight models’ parameters. We set the number of epochs to 150 and used a batch size of 64 for each model training. The we adopted an early stopping strategy, which meant that the model training will be stopped automatically stopped when its accuracy (patience and min_delta were set to 10 and 0.001, respectively) on the internal validation set no longer significantly improved for some time, thereby preventing overfitting. We calibrated each model using an internal validation set with temperature scaling, a method for calibrating deep learning models and assessed the calibration performance using the Brier-Score and Log-Loss. During the validation and inference stages, the model’s image preprocessing process was consistent with the training stage, but automatic image transformation was no longer implemented. We used the PyTorch framework (version 2.1), a computer with the Ubuntu 20.04 system and an NVIDIA GeForce RTX 4090 to complete the entire experiment. The weights of all models were saved in ‘Pth’ format.

In this study, we developed a responsive and user-friendly Android application that prioritises maintainability and scalability. Various software engineering principles and practices (e.g. separation of concerns and dependency Inversion) were followed to ensure that the Nose-keeper application maintain consistent performance and reliability across a diverse range of Android devices with varying hardware capabilities. Additionally, we adopted a responsive layout and conducted multiple user tests and iterative feedback to ensure that Nose-Keeper’s user interface is simple and easy to use and can adapt to various screen sizes, different screen orientations and device states. Utilising Java for native Android development, we embraced the MVVM design pattern for application modularisation, incorporating bidirectional data binding for seamless UI and data synchronisation. Our tech stack included Retrofit for network requests alongside third-party libraries like ButterKnife, Gson, Glide, EventBus, and MPAndroidChart for enhanced functionality and user experience, complemented by custom animations and NDK for hardware interaction. At the backend, we leveraged SSM (Spring + SpringMVC + MyBatis), Nginx, and MySQL for a high-performance architecture. For database, we used MySQL to manage data and adopted Redis for caching. The backend of the application and deep learning model were deployed on a high-performance Cloud Server (Manufacturer: Tencent; Equipment Type: Standard Type S6; Operating System: Centos 7.6; CPU: Intel® Xeon® Ice Lake; Memory: DDR4) with Nginx load balancing to optimise server resource utilisation (See Supplementary Note 1 for details). To ensure the security of applications and personal privacy data, we used encryption protocols and algorithms and toolkits that comply with industry standards (See Supplementary Note 2 for details). When utilising Nose-Keeper, all input images must go through an image preprocessing pipeline consistent with the model inference stage.

Model evaluation and statistical analysis

For the development datasets (SZH and FSH), eight models were evaluated using five standard metrics: overall accuracy (Eq. (1), precision (Eq. (2), sensitivity (Eq. (3), specificity (Eq. (4), and f1-score (Eq. (5). The definitions of these five metrics were as follows (See Supplementary Note 3 for details).

$$Overall\,accuracy=\frac{True\,Positives+True\,Negatives}{Total\,Samples}$$
(1)
$$Precision=\frac{True\,Positives}{True\,Positives+False\,Positives}$$
(2)
$$Sensitivity=\frac{True\,Positives}{True\,Positives+False\,Negatives}$$
(3)
$$Precision=\frac{True\,Negatives}{True\,Negatives+False\,Positives}$$
(4)
$$F1-score=2\times \left(\frac{Precision\times Sensitivity}{Precision+Sensitivity}\right)$$
(5)

To avoid performance uncertainty caused by random splitting of the development dataset, we used a fivefold cross-validation strategy to evaluate the potential of various models on the development dataset and then selected four more excellent models from the eight models that can be used for the smartphone application based on the quality of the metric results. After selecting the four candidate models, we used a confusion matrix and Receiver Operating Characteristic (ROC) curve to further evaluate the performance of the candidate models in an external test set (LZH). A larger area under the ROC curve (AUC) indicated better performance. We used the best model to develop a smartphone application. Statistical analyses were performed using Python 3.9. Owing to the large sample size of the internal dataset and the use of fivefold cross-validation, we used the normal approximation to calculate the 95% confidence intervals (CI) of overall accuracy, precision, sensitivity, specificity, and f1-score. In the external test set, we used an Empirical Bootstrap with 1000 replicates to calculate the 95% CI of the AUC. The 95% CIs of overall accuracy, sensitivity and specificity were calculated using the Wilson Score approach in the Statsmodels package (version 0.14.0).

Analysing the robustness of the deep learning models via data augmentation

The use of images with different data augmentations to test the model can reveal its adaptability to input changes and analyse its robustness70. In particular, data augmentation simulates possible image transformations in practical applications, thereby testing the stability and performance of a model when faced with unseen or changing images. This strategy helps developers identify the potential weaknesses of the model, guide subsequent improvements, and enhance the application reliability of the model in complex and ever-changing environments. We used an external test set to analyse the prediction result changes of the model under Gaussian blur, Saturation changes, Image rotation and Brightness changes. Prior to testing the model, we augmented the external test set using Pillow (version 9.3.0). For each transformation, we assigned different parameter values to the built-in functions of Pillow, resulting in 12 enhanced datasets from the external datasets.

Human-machine comparison experiment

The representativeness of the external test set is crucial for fully comparing the performance differences between AI and human experts. Therefore, when retrospectively collecting endoscopic images, in addition to ensuring the accuracy of image labels, our expert team also fully considered the severity of lesions, different stages of disease, and differences in appearance in each endoscopic image. Meanwhile, the expert group also ensured as much as possible the age difference and gender balance of the entire dataset. Especially, the time span of external test set has reached five years. We recruited nine otolaryngologists with different clinical experiences from three institutions, i.e. one year (two otolaryngologists), three years (one otolaryngologist), four years (one otolaryngologist), five years (one otolaryngologist), six years (one otolaryngologist), eight years (two otolaryngologists), and nine years (one otolaryngologist). Before each expert independently evaluated the external test set, we shuffled dataset and renamed each image as “test_xxxx. jpg” and distributed it to all experts. We required experts to independently evaluate each endoscope within a specified time frame to simulate the physical and mental stress faced by experts in actual clinical settings, which further reflects the efficiency of AI. Notably, we prohibited experts from consulting diagnostic guidelines and mutual communication. All expert evaluation results were anonymized and automatically verified through a python program. Finally, we plotted a diagnostic performance heatmap, confusion matrix, ROC curve, and optimal Youden-index to comprehensively and intuitively demonstrate the performance differences between AI and clinicians in diagnosing different diseases.