Introduction

Leprosy is a chronic contagious disease and can cause irreversible nerve damage. Its etiological agents are Mycobacterium leprae and Mycobacterium lepromatosis that present a long incubation time of 2 to 5 years1. They mainly attack the Schwann’s cell in the peripheral nerves2 and skin cells causing dermatoneurological symptoms and signs3, which are essential for the clinical diagnosis of leprosy4.

Leprosy diagnosis faces some limitations. The disease can mimic rheumatological pathologies like inflammatory arthritis, nonspecific arthritis, and vasculitis5, diabetic and amyloid neuropathy6, or dermatological diseases like lupus7, mycosis fungoides8 and psoriasis9. Besides, leprosy is usually not suspected as it is no longer emphasized in medical curricula nowadays10. Although there is a preeminent serological immunoassay for leprosy11, there is no laboratory test that alone can make the diagnosis of the disease1. A study made in the emergency room12 with patients who have had wrong diagnosis in the referral showed that leprosy can mimic acute arterial occlusion, acute coronary syndrome, deep vein thrombosis, and venous ulcer, corroborating to the fact that several professionals present a lack of training in leprosis diagnosis, indicating a possible serious cases of undiagnosed leprosy in patients of the emergency room.

In 2019, more than 200,000 new cases of leprosy were globally reported to the World Health Organization (WHO)13. One of the four strategic pillars of WHO goal of a zero-leprosy world is an integrated active case detection14. Several active search strategies are usually employed by healthcare professionals or trained volunteers like door-to-door search15,16, household contacts evaluation17, evaluation among school children18 and search in prison population19,20.

The Leprosy Suspicion Questionnaire (LSQ) is composed of 14 questions that are normally asked during a medical appointment and cover both neurological and dermatological symptoms. The LSQ is a screening tool for the most common signs and symptoms related to leprosy from its early to late stages. By applying it in the community, we aim to inform the population about the disease (health education), in addition to selecting individuals suspected of having early neurological symptoms of the disease, significantly increasing the chance of them being recognized as having the disease (screening of early cases). The LSQ is a questionnaire for patient screening, and it proved to be an easy-to-use, low-cost tool that can be filled either by a health worker or by the very individual whose answer could indicate the possibility of being clinically evaluated by specialists. It has been developed by the team from the National Referral Center for Sanitary Dermatology and Hansen’s Disease during several active search campaigns19,20,21. After those campaigns, the studies that followed remarked different patterns between healthy individuals and new cases, where the neurological symptoms were more important compared to the cutaneous signs. A computational tool that handles patterns very well is machine learning algorithms and classifiers.

Since the dawn of artificial intelligence, different applications and studies have been developed in several usages within healthcare employing machine learning techniques. These techniques were applied to predict sepsis in ICU patients22,23, heart disease24,25,26, Parkinson’s disease stage27,28,29, cancer30,31,32,33, and so on, as well as in preventive medicine34,35,36.

Machine Learning techniques have also been useful for classification and analysis in leprosy studies37. For instance, AI4Leprosy, a diagnosis assistant running an image-based Artificial Intelligence (AI) that employed convolutional neural networks plus traditional methods as logistic regression, random forest and XGBoost38 resulting in an AUC of 98.74%. Another remarkable application of AI was the work conducted in the Brazilian North clustering vulnerable regions susceptible to a leprosy endemic using Self-Organizing Map (an unsupervised learning technique) in epidemiological data from Geographic Information System39. Also, a leprosy screening purpose application of AI was developed in Brazil whose data came from the National Notifiable Disease Information System – SINAN, but it was meant only to distinguish paucibacillary and multibacillary, not intended to screen patients in an active search for new cases40. An interesting approach was carried on predicting new cases among household contacts implementing a random forest as classifier using molecular and serological results as data entry41. However, none of the solutions were provided with the sole purpose of screening, but more intended to help doctors in diagnosis and set regions to an active search campaign.

Our work intended to apply machine learning algorithms to screen individuals for leprosy based on how they filled the LSQ, we called this tool Machine Learning for Leprosy Suspicion Questionnaire Screening (MaLeSQs). We have chosen the best one within four classifiers representing each a different paradigm: the kernel-based Support Vectors Machine, the regressor Logistic Regression, the tree Random Forest, and the boosting XGBoost. Data cleaning was performed to handle missing values, an exploratory data analysis was conducted to allow readers to better understand data and create a more solid comprehension to the machine learning approach. A pre-processing stage was made with data augmentation by crossing questions and applying the Synthetic Minority Oversampling Technique (SMOTE), correlation was handled with phi-coefficient. Boruta was implemented to clean up noisy attributes. Hyperparameter optimization was performed by exhaustive search by pairs. The metrics used to evaluate classification were sensitivity, specificity, precision, negative predicted value, receiver operating characteristic curve and the area under the curve. Shapley values42 were calculated for each classifier to better understand the classification process43 as well as bring some insights about the LSQ and its relation to the disease. We finally discuss and compare our results with other screening strategies based on the new case detection rate (NCDR) and other machine learning models built to assist leprosy diagnosis.

Methods

In this section we describe the methodology employed to achieve the best classifiers. We first introduce the Leprosy Suspicion Questionnaire, the studied population and the necessary adaptations made to fit old questionnaires into the registered version in Sect. 2.1 to 2.3. Secondly, we present the implementation environment employed to train MaLeSQs in Sect. 2.4, describe the preprocessing steps in Sect. 2.5 to 2.7, and the machine learning step in Sect. 2.8; these can be followed with the aid of Fig. 1.

Fig. 1
figure 1

Design of machine learning algorithms. The missing data was dropped from the dataset. A data augmentation process was implemented by combining questions. The data set was split in training and test sets in the proportion of 80:20. Based on Φ coefficient (check supplementary material for definition), variables with high associations were dropped from training set and those same variables were dropped from the test set. SMOTE was applied in a copy of the training set and only the variables accepted by Boruta were kept in both training and test set. Four classifiers, Support Vectors Machine (SVM), Logistic Regression (LR), Random Forest (RF) and XGBoost (XGB), had their hyperparameters optimized (HyperOp) within a pipeline with SMOTE implemented and the trained models were used to classify the test set whose predictions were compared to their true values generating several metrics availed to compare performances.

The leprosy suspicion questionnaire

The 14 questions of the LSQ are shown in Table 1. They cover dermatological symptoms (\(\:q6\), \(\:q8\), \(\:q9\), \(\:q10\) and \(\:q13\)), neurological symptoms (\(\:q1\), \(\:q2\), \(\:q3\), \(\:q4\), \(\:q5\), \(\:q7\), \(\:q11\) and \(\:q12\)), and the contact factor of leprosy (\(\:q14\)). Those questions are going to be treated as variables in the next sections and, therefore, are graphed as such in this text. The registered questionnaire can be downloaded in https://www.crndsh.com.br/qsh both in Portuguese and English. These questions wait marked or not-marked answers. If the question was marked, it received value 1, if blank, the value was 0.

Table 1 The 14 questions of the registered leprosy suspicion questionnaire.

Study population

The dataset contained 1,842 instances. As it had been applied in several situations, database came from five different events. There were three different campaigns to identify leprosy patients in Ribeirão Preto region in the state of São Paulo, Brazil19,20,21 (Jardinópolis City, J; Female Prison in Ribeirão Preto, FPRP; and Center of Penitentiary Progression from Jardinópolis, CPP); a screening process made in Ribeirão Preto City during a professional training (PLTD)44; and patients from a private clinic (PC). Those campaigns pursued an active search for leprosy patients. The distribution of LSQs from each event is shown in Fig. 2. The clinical diagnosis of every participant of this study was made by a specialist in leprosy. The entire dataset was anonymized and only the variables with respect to the LSQ were maintained.

Fig. 2
figure 2

Distribution of patients and health individuals from the five campaigns whose data was available to make the present study. In gray we see that all the 22 individuals from private clinic (PC) were patients. In light orange, from the 34 individuals screened during the professional training in leprosy diagnosis (PTLD) 26 were healthy individuals and 8 were patients. In light blue, from the 487 individuals that filled the LSQ in the city of Jardinópolis (J), 423 were healthy individuals and 64 were patients. Within the 404 individuals from the female prison in Ribeirão Preto (FMRP) showed in blue in the diagram, 390 were healthy individuals whereas 14 were patients. Finally, 895 inmates from the Center of Penitentiary Progression (CPP) were screened leading to 31 new cases and 864 healthy individuals that are shown in dark blue. All the campaigns combined resulted in 139 new case detections and 1703 healthy individuals portrayed in red and orange, respectively.

From the 1,842 participants, 1,089 were male and 753 were female. The mean age of 1,839 participants (3 were missing) was 36 with a standard deviation of 15 years. The minimum age was 2 and the maximum was 94 with median of 33 years, with first quartile in 25 years and third quartile in 44 years. Within the new cases detected, 122 of the diagnosis were borderline leprosy, 7 were tuberculoid, 5 were neural, 1 was indetermined, and 4 were blank. The NCDR within the studied population was 7.5%.

Adapting LSQ

All data was provided in Excel sheets. In all five events where LSQ was applied, two different versions were used, and both were slightly different from the registered version. Both versions are in Supplementary Tables 1 and Supplementary Table 2. According to the specialist advice, the first version of the LSQ was adapted in the following way: questions 1, 2 and 3 were kept in the same order being exactly the same questions as in the registered version, question 4 was assigned an empty value because a question about muscle cramps was not part of this version, then questions 5, 4, 7, 6, 8 and 9 in this sequence, as they present the same statements of those in the registered version of the questionnaire in this very order, questions 10 and 11 were merged because in the registered version there is only one question with both of these statements, then question 12 being the same question as in the registered version, questions 13 and 14 were merged because there is one question with both assertions. To merge questions, we used an OR (˅) operator. The second version of the LSQ only went through a change in the order of the questions since they are the same questions of the registered version of the LSQ but in a different order. The order was changed as follows, questions 1, 2, 5, 4, 3, 6, 8, 7, 9, 10, 12, 13, 14 and 16; question 15 was not present in the registered version of the questionnaire. The necessary adaptations to the LSQ as described above are in Table 2. The adaptations were made directly within the last version available of Excel and a csv version of the file was exported.

Table 2 The adaptations made in the different versions of the LSQ to match the registered version.

Implementation environment

Most of the code was implemented in Python 3.9 on the latest available version of Jupyterlab in the Anaconda environment. The libraries employed to this study were Numpy 1.21, Pandas 1.4, Matplotlib 3.5, Sklearn 1.0, XGboost 1.5, Imblearn 0.7 e Shap 0.7. To implement Boruta, we used R language with Boruta 7.0 package. A simplified version of the notebook will be made available over demand. All ran on a machine with Windows 10 Pro 64-bits operational system in an AMD Ryzen 5 3350G processor with 16 GB RAM. The implementation of the code took around one month to complete, and the experiment took two days to run. In Table 3 we summarized the setup for the experiment.

Table 3 Details of the implementation environment.

Missing data and exploratory data analysis

Due to the necessary adaptations, and because most LSQ came from version 1, a couple of questions were almost entirely blanked. As questions 4 and 14 presented 98.2% of missing data, we had to drop both variables. So, muscle cramp was not evaluated as a symptom linked to leprosy and the contact influence implied by question 14 was set aside, leaving the opportunity for future research. Finally, the chi-square test was used to verify associations between each question and the diagnosis made by the specialists.

Data augmentation and train-test split

To enrich the dataset, a process of data augmentation was implemented. This process consisted of creating new variables based on the combination of a pair of questions using an “AND” operator. If questions qi and qj were marked (i.e., the value of 1), a new variable qiqj was created with value 1; else, the new variable qiqj would receive a value of zero. This was replicated to every question in the dataset. This data augmentation process resulted in a total of 78 variables in total.

After this process, we stratified and splitted the dataset into training and test sets in the proportion of 80:20. This resulted in 1362 healthy individuals and 111 new cases detection in the training set, and 341 healthy individuals and 28 new cases detection in the test set. All the next processing steps were performed and calculated in the training set and replicated in the test set, except for SMOTE, that was applied only in the training set.

Balancing the data and data-cleaning

The data was slightly imbalanced presenting a 12.25:1 ratio. To overcome this problem, the Synthetic Minority Oversampling Technique (SMOTE) was applied only in the training set, i.e. in the 80% data split in the previous step. SMOTE was applied in two different steps during this work. First it was used before applying Boruta to calculate feature importance as it will be described next session. Second, it was applied within a pipeline during hyperparameters optimization and model training to avoid data leakage. After SMOTE, we ended up with 1,251 synthetic samples of new cases detection in the training set. SMOTE was not applied in the 20% test set, where data was let as it was collected in the real world.

The data augmentation process implemented could result in undesirable associations between variables, spoiling the dataset to feed classification algorithms in further steps. As we had only binary variables, we applied Matthews Correlation Coefficient φ to investigate these associations in the training set. To choose a coefficient threshold to drop features we trained the four classifiers without hyperparameters optimization, using SMOTE and Boruta. We tested from moderated to strong correlation coefficients45 whose results can be checked in Supplementary Tables 4 and 5. Except for the Support Vectors Machine, that all AUC values were statistically equal, we can observe a small drop in AUC values for Logistic Regression, Random Forest and XGBoost with the increase of the correlation coefficient used as threshold to drop features. When we compare 0.65–0.70, 0.65–0.75, and 0.70–0.75 for the four classifiers, the AUC are statistically equal. Therefore, 0.75 was chosen as a threshold to maintain the greatest number of features as possible in a way that not too much information would be lost, but still improving the dataset to hyperparameter optimization. For all associations with φ ≥ 0.75 at least one of the variables was dropped from the dataset.

Also, due to data augmentation process, noisy variables might appear. To get rid of this noise, we used Boruta method to verify feature importance. But first we had to balance the training set. For this purpose, the Synthetic Minority Oversampling Technique (SMOTE) was applied. And only after SMOTE, Boruta was availed to the training set. The p-value threshold considered to keep a variable was set to 0.05. Solely the variables accepted by the algorithm were maintained on the dataset. Details on how Boruta algorithm works can be found in the Supplementary Material. Finally, the data was ready to feed the classifier models for hyperparameter optimization and training.

Machine learning stage

The classifiers and hyperparameter optimization

Four classifiers with different paradigms were employed. The one representing kernel-based models was the Support Vectors Machine (SVM); Logistic Regression (LR) represented linear classifiers; ensemble tree methods were embodied by Random Forest (RF); and to stand for boosting methods, the Extreme Gradient Boosting (XGBoost or XGB). All the classifiers were implemented within a pipeline to facilitate their combination with SMOTE during hyperparameter optimization and avoid data leaking to the validation set.

Hyperparameter optimization is an important stage. It was performed because best classification means more correct predictions, and bad predictions might mean several ill people been classified as healthy. This is not interesting at all for the screening purpose intended for the present study.

Hyperparameters were optimized exhaustively in pairs of hyperparameters with stratified cross validation with 5 splits and 20 repetitions. The score parameter to evaluate best performance in optimization was the area under the receiver operating curve (AUROC). The intervals and steps chosen for each hyperparameter for each classifier are shown in Table 4. The hyperparameters were optimized in the same order as depicted in the table. Identical superscript symbols mean that the hyperparameters were optimized at the same time. A description of each hyperparameter can be seen in Supplementary Table 3. The hyperparameters for the 4 machine learning methods are: (1) SVM: C and kernel, if the best kernel chosen was polynomial one, we would optimize degree as well (it was not the case of the present work); (2) LR: C and penalty; (3) RF: number of estimators (n_estimators) and maximum depth of trees (max_depth), minimum samples split in each node (min_samples_split) and maximum features (max_features) used in each tree, minimum samples in each leaf (min_samples_leaf) and maximum samples to draw during bootstrap (max_samples), and the maximum number of leaf nodes (max_leaf_nodes); and (4) XGBoost: were learning rate (learning_rate) and number of estimators (n_estimators), maximum depth of trees (max_depth) and minimum sum of instance weight in a child node (min_child_weight), subsample ratio of instances (subsamples) and subsample ratios of columns (colsample_bytree), L2 regularization (gamma) and the minimum loss reduction required to partition a node (lambda), and L1 regularization (alpha). To ensure that good values were chosen, we plotted the entire range (check Supplementary Fig. 3) and avoided picking hyperparameters in the extremes of the interval for the first pair of hyperparameters that were being optimized. Adjustments were not made to prevent overfitting46 and to avoid local optimum47 during this validation step.

Table 4 Entry parameters for hyperparameter optimization from exhaustive search.

Evaluation of classifiers performance and explaining classification

After hyperparameter optimization, the trained classifier models were applied in the test set. The classification from the models was compared to the diagnosis given by the specialists. Performance of classifiers were measured with metrics from confusion matrix, sensitivity, specificity, precision, and negative predicted values. ROC and AUROC were also reported to show the predictive power of models. We also translated the sensitivity to NCDR, being able to compare with other studies.

To better understand what is happening inside each classifier, Shapley Values were calculated. With these values, it was possible to bring important insights from the classification and better understand how each model is using the top 10 most important features from the dataset to predict the outcome.

Role of the funding source

The funder of the study had no role in the study design, data collection, data analysis, data interpretation, or writing of the report. All authors had full access to all the data in the study and had final responsibility for the decision to submit for publication.

Result

Exploratory data analysis and pre-preprocessing

A chi-square test was done to compare the LSQ results between healthy individuals and Hansen’s disease new case detections. The test shows that questions q8 (p = 1.0000), q10 (p = 0.1627) and q13 (p = 0.8880) did not present any significant statistical difference (Table 5). However, we held the three variables within the dataset to keep as most information as possible at this stage of the analysis. Also, nine questions with p-value < 0.05 strongly indicates LSQ as a useful patient screening tool because of remarkable statistical difference between groups.

Table 5 Number of positive questions answered in the LSQ by the whole group of individuals, the healthy group and new cases.

The association matrix built with ϕ-coefficient can be checked on the heatmap in Supplementary Fig. 1. The process of dropping variables with ϕ > 0.75 reduced the number of variables in the dataset from 78 to 54.

To enhance computational cost during hyperparameters optimization Boruta feature selection was implemented. Only the Accepted variables were kept in the dataset resulting in a dataset with 38 variables. These results are shown in Fig. 3. All the single questions were accepted showing a different angular coefficient with respect to feature importance as compared to combination of questions. We leave this door open for future research.

Fig. 3
figure 3

Result from Boruta feature selection implemented in the dataset. The blue boxplots are the importance of the shadows (check Supplementary Material to read more about Boruta and the shadow variables), the yellow ones are the tentative, and the green ones are the accepted variables. Only the ones marked as Accepted were kept in the dataset.

Classifiers and hyperparameters optimization

Table 6 depicts the best values chosen by grid search for each hyperparameter. The best hyperparameters chosen for SVM were C = 0.0464 and the Kernel chosen was the radial basis function (rbf). The best hyperparameters for LR were C = 0.00202 and the L2 penalty. The hyperparameters for random forest were 450 estimators (trees) with the maximum depth of 5, the minimum samples split was 100, the maximum number of features was 6, the maximum samples rate was 0.3, the minimum number of samples per leaf was 10 and the maximum amount of leaf nodes was 9. For the XGBoost, the best learning rate was 0.0001 for 100 estimators, with a maximum depth of the trees of 6, the minimum child leaf weight of 5, a subsample rate of 0.5, a rate of columns chosen by tree of 0.5, a gamma equals 0.2, the best lambda was 1.0, and the best alpha was 0.02.

Table 6 Results from hyperparameter optimization from exhaustive search.

Performance of the classifiers

The metrics used to measure performance were from confusion matrix, and the ROC and the respective area under the curve. In Table 7 we show sensitivity, specificity, precision, and negative predicted value for each classifier. With SVM we achieved a sensitivity of 85.7%, a specificity of 69.2%, a precision of 18.6% and a negative predicted value (NPV) of 98.3%. Logistic Regression achieved a sensitivity of 60.7%, specificity of 80.7%, a precision of 20.5% and a NPV of 26.2%. Random Forest reached a sensitivity of 75.0%, a specificity of 76.0%, a precision of 20.4% and a NPV of 97.4%. Last, XGBoost got a sensitivity of 67.9%, a specificity of 77.7%, a precision of 20.0% and a NPV of 96.7%. As we searched for a balance between sensitivity and specificity, the best classifier following these metrics was SVM, even though it presented a low precision, because as this is intended for screening purposes only, it becomes very interesting to have some false positives instead of risking having more false negatives.

Table 7 Metrics obtained applying the trained models on the test set.

In Fig. 4 we show the ROC curves for each classifier with the respective area value depicted in the legend. We can state from these results that the classifier with the strongest predictive power is SVM, due to its bigger AUROC of 0.776, 1.7% bigger than the second place XGBoost, and 8.7% bigger than the worst classifier Logistic Regression.

Fig. 4
figure 4

ROC curve for each classifier applied on the dataset. In red is the performance of SVM, in blue we show the performance of the Logisic Regression (LR), in green we display the performance of the Random Forest (RF) and in yellow we represent the performance of the XGBoost (XGB).

Interpretability of models with Shapley values

Figure 5 shows the results of Shapley values for each classifier. The high value in the colorbar means marked answer for the question, and low value, a non-marked answer. From these results it is interesting to highlight two aspects. The first one is that the top 10 most important variables for all the classifiers were basically the same, i.e., q1, q2, q3, q5, q6, q7, q8, q11, q1q7 appeared within the top 10 of the four classifiers, q9 appeared in two (SVM and XGB), and q3q5 also in two (LR and RF). The second one is the counterintuitive fact that for some questions, notably q8, q9 and q11, when the individual response was positive, it had a negative impact on the model, meaning that when individuals presented those symptoms, the model tended to classify the individual as healthy. This happened because these three questions concern more advanced symptoms of the disease.

Fig. 5
figure 5

Shapley values for each classifier. In the left pictures are the individual Shapley value for every participant in the research where red means marked question and blue no marked question. In the right side of the pictures are the mean of the absolute Shapley values of the participants for the respective variable. In (a) we show the values for SVM, in (b) for Logistic Regression, in (c) for Random Forest and in (d) for XGBoost. The + 0 that appears in the bar graphs is a rounding provided by the Shap library, but their values are the ones in the x-axis.

Discussion

In past years, several works applying machine learning in healthcare questionnaires with yes-no answers have been presented to the community48,49,50,51. Many research exploited the use of feature selection52,53,54. Numerous classifications implemented SMOTE to achieve better prediction results55,56,57 and brought more interpretability to their model with Shapley Values58,59,60,61. However, none of them addressed Hansen’s Disease and the Leprosy Suspicion Questionnaire.

Leprosy is an under-diagnosed disease. In 2022, the LSQ was recognized by the Brazilian Health Ministry as a successful experience in leprosy field62 as a tool to improve diagnosis and health education. Furthermore, Brazilian government adopted the LSQ as part of its effort in active search for new leprosy cases63 and our Primary Care Units are prepared to apply it in their clinical routine making a checklist of symptoms and signs to think of and recognize leprosy. Besides, several studies and strategies for municipal health promotion addressing communities are employing LSQ as a tool for active search for new leprosy cases19,20,21,44,64,65,66,67,68,69,70.

Our results point out the strength of MaLeSQs applying machine learning algorithms on the analysis of LSQ responses for diagnosing HD. The good balance of sensitivity and specificity achieved by the classifiers, when their sum is bigger than 1.5 71, shows evidence that the test is useful, and an acceptable AUC value (0.7–0.8)72 guarantees the quality of the model. The low precision assures that even though healthy people might be alarmed by a false positive, these will look for health assistance, and health workers will be able to evaluate and educate a person about the disease. It is not necessary to fully understand the machine learning models to employ them, being able to correctly apply and instruct people about the LSQ filling is enough.

The LSQ is easy to distribute, and people can fill it by themselves or aided by a community health agent. Besides, there is no need to think on calculations to decide whether an LSQ is positive or negative, the classifier does it by itself. The model encapsulates all knowledge acquired during the campaigns in regards of the LSQ and can be easily distributed via a web application and reach far places with a relatively low cost with no need for an expensive staff to be mobilized on field to exclusively work on patient screening.

Based on the values of Table 7 it is possible to calculate the relative risk (RR). A person classified as New Case (LSQ+) by the SVM is 10.9 times more likely to carry leprosy than one classified as Healthy Individual (LSQ-). As for Logistic Regression, an LSQ + is 5.4 times more likely to be leprosy than an LSQ-. Whereas for Random Forest, an LSQ + is 7.8 times more likely to be a New Case than an LSQ-. Finally, an LSQ + by XGBoost means 6.1 times more likely to be a new case. Therefore, an LSQ + by any of the classifiers employed in this study presents a high risk (RR > 2.0) of being a new case.

During the campaigns in the CPP19, in the FPRP20 and in Jardinópolis21, they elicited positive LSQ as a screening criteria. LSQ + meant at least one marked question. Applying the same idea in the present dataset we would achieve a sensitivity of 92.9% and a specificity of 64.8%. The sum of 1.58 is above the threshold of usefulness of a test in health care, whereas applying machine learning we achieved a sum of 1.55 with SVM and 1.51 with Random Forest. Only LSQ alone is already a powerful tool to screen leprosy patients. However, the decrease of 7.2% points in sensitivity meant only 2 more subjects being classified as false negatives, and the increase of 4.4% points meant 15 more subjects being classified as true negatives (Supplementary Fig. 2 shows the confusion matrix of both cases illustrating this result). Those 15 subjects classified as LSQ- would not be called for a consultation or further investigation with clinicians and specialists, saving money and several medical-hours, considering the long time needed for diagnosing leprosy.

A systematic review of diagnostic accuracy of tests for leprosy73 included 78 studies. This review evaluated the detection of IgM antibodies against phenolic glycolipid I using ELISA, qPCR and conventional PCR. The sensitivities were 63.8% (95% CI 55.0–71.8), 78.5% (95% CI 61.6–89.2) and 75.3% (95% CI 67.9–81.5) respectively. The specificities were 91.0% (95% CI 86.9–93.9), 89.3% (95% CI 61.4–97.8) and 94.5% (95% CI 91.4–96.5) respectively. SVM achieved a higher sensitivity (85.7%) than all those traditional tests and RF presented an equivalent sensitivity (75.0%). Despite the lower specificities (69.2% for SVM and 76.0% for RF), the combination of the LSQ with machine learning algorithms does not require either blood samples, expensive equipment, or highly trained personnel, demonstrating its high applicability in guiding leprosy diagnosis.

The work closest in goals to ours was the AI4Leprosy38. Although they had good results with an AUC of 98%, a sensitivity of 89% and a specificity of 100%, they used only 222 patients, focusing most on spots on the skin, using pictures taken under controlled circumstances with DLSR cameras in a photographic studio. Whereas in this study we used the LSQ, a questionnaire with simple questions that can be filled with the aid of a health professional or by the individuals themselves. Moreover, because it was an image-based screening tool, they had to use neural networks (Inception-V4 and ResNet-50) combined with machine learning methods (elastic-net logistic regression, XGBoost and Random Forest), and cloud computing to train their model38. While our questionnaire-based screening approach was trained only on machine learning methods using a common desktop. Besides, our tool was able to differentiate healthy individuals from new cases even if they did not present any spot on the skin, focusing also on the early leprosy neurological symptoms74. MaLeSQs is an economical tool to be used in screening in remote regions or based on telemedicine75,76,77. That turns it into a low cost, accessible tool to be used by public healthcare systems attending large geographical regions. Note that MaLeSQs may be integrated alongside an image analyzer78.

We used a high prevalence study population (7.5%), what probably caused a high NCDR of 85.7% for SVM, 60.7% for LR, 75% for RF and 67.9% for XGB, i.e., the sensitivity of each algorithm. A study in Ethiopia conducted full village surveys in selected villages whose screening methods were looking for skin lesions suspected for leprosy and including all household contacts; they achieved a NCDR of 9.3 per 10,000 population79. Another study performed in Cambodia among household contacts and neighbor contacts who presented clinical signs of leprosy followed by a leprologist evaluation achieved respectively a NCDR of 25.1 per 1000 population and 8.7 per 1000 population80. A community wide screening conducted in Malaysia where medical officers went through an active case detection activity searching for abnormal skin changes achieved a NCDR of 722 per 10,000 population, n = 6/8381. Therefore, further studies might be pursued where these algorithms are used as a screening tool in the field to check if this high performance is sustainable. In addition, geostatistical data might be used to select regions with higher probabilities to find new cases.

The Shapley values for feature importance brought great insights into the interpretation of our results. Although the ratio of combining dermatological and other signs questions to neurological ones was 5:7, the ratio of questions associated within the top 10 to those symptoms selected by importance varied from 7:3 (p < 0.0001) to 8:2 (p < 0.0001), with statistical difference. We can infer that neurological symptoms had a greater statistical importance as compared to other symptoms for the classifiers when both types of symptoms are evaluated together. Besides, questions related to high degree of physical incapacity or a more advanced disease within the spectrum (notably q8, q9 and q11) were more related to negative Shapley values. Thus, those symptoms are not the best ones to find an early diagnosis82, but indicating that the association of LSQ and machine learning algorithms might be ideal to be used during screening for initial symptoms.

For further studies, a wider application of the LSQ is indicated. We limited our dataset within a Brazilian region where leprosy is endemic; then different outcomes may be expected in low endemic regions. Results may also vary in countries and regions with population outside the age range, given that peripheral neuropathy prevalence is higher amongst elderlies83, and with different degrees of the disease within the leprosy spectrum, considering that people in different degrees of the spectrum present different signs and symptoms74. In future work, we intend to validate MaLeSQs on external datasets with different populations to ensure generalizability across various regions. Unsupervised feature selection53 instead of Boruta might lead to a more balanced choice for question this part of the experiment. Frameworks that optimize SVM parameters and select optimal features like JASMA-SVM84 are interesting innovative approaches to improve performance. It would be compelling as well to compare results obtained in this study with the LSQ filled by people with diabetes, carpal tunnel syndrome, and other diseases that may affect the peripheral nervous system.