Introduction

Parkinson’s disease (PD) is a neurodegenerative brain disorder caused by the death or impairment of specific midbrain neurons, mainly categorized as a “Movement Disorder”1. This condition is primarily caused by the degeneration of dopamine-producing neurons in the substantia nigra (SN) of the brain. This impairment causes a significant reduction in dopamine levels, which disrupts motor control and leads to symptoms such as tremors, muscular rigidity, bradykinesia, postural instability, and difficulty walking2. Dopamine is a chemical messenger that helps produce smooth and coordinated muscle and movement functions by transferring signals between the SN and another area of the brain called the corpus striatum3. Regarding motor control, dopamine enhances communication between the SN and the striatum, leading to the nigrostriatal pathway. This pathway regulates motor circuit activity, allowing for the smooth initiation of voluntary motions. Cognitive and behavioral problems, including dementia and depression, may develop as the disease progresses2.

Beyond motor symptoms, PD can also manifest with various non-motor issues. Rigidity, speech disorder, rest tremors, constipation, instability in gait, rapid eye movement (REM), postural instability, and bradykinesia are examples of motor symptoms. In contrast, non-motor symptoms include neurobehavioral, mood disorders, sleep disturbances, cognitive decline, and sensory issues4. These symptoms are primarily due to a lack of dopamine, a neurotransmitter in the brain that regulates coordination and motion. Understanding the significance of dopamine insufficiency is critical for creating successful therapies for PD. A vital treatment is dopamine replacement therapy, which functions as a precursor to dopamine and helps refill its generalizability might represent an ongoing, unresolved levels in the brain, alleviating motor symptoms. However, a cure for the condition is still not available5. Scientists are conducting research to improve patient care, develop earlier diagnosis methods, and create more effective treatments for the illness6,7.

Detecting the appropriate treatments to halt the progression of the disease can be facilitated by the early diagnosis of PD, ongoing monitoring of the condition’s severity (i.e., preserving the health of the brain’s neurons), and monitoring the progress of those treatments over time. However, diagnosing PD can be challenging due to its complexity and the need for clinical expertise, which may result in misdiagnosis8. Therefore, novel, less expensive, simpler, and more reliable approaches to Parkinson’s diagnosis and treatment should be developed9,10. Researchers employed several machine learning (ML) techniques to improve intelligent systems that can accurately diagnose PD across various datasets. ML approaches are the process of automatically applying algorithms to datasets to retrieve valuable patterns. Typically, these approaches are used to train a computer-aided diagnosis system to make decisions about classifying previously unidentified data instances. ML approaches are categorized into supervised learning (e.g., classification and regression), unsupervised learning (e.g., clustering), and reinforcement learning (e.g., Q-learning)11. In the context of PD diagnosis, ML approaches may assess patient data such as motor symptoms, imaging scans, and genetic data to recognize patterns over time and estimate disease progression7,12.

Furthermore, some researchers have used a group of ML methods named deep learning (DL) to address the challenges of ML by automating feature extraction. DL uses multilayer ANN. DL excels at extracting beneficial features from diverse data such as neuroimaging data, motor symptoms, and time series13,14. This feature is beneficial during PD diagnosis, where slight alterations in motor functions or brain activity might be challenging to recognize manually. Popular DL models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), demonstrated potential in processing multiple types of patient data for PD diagnosis15,16,17. By comparing the efficacy of several ML and DL algorithms in diagnosing PD, we intend to emphasize their potential to enhance early diagnosis, track disease progression, and assess the treatment process.

Conducting a thoroughly systematic literature review (SLR)18,19 in diagnosing PD based on ML with an emphasis on improvements, research questions, future works, and PD detection directions is rare. This paper not only reviews the application of well-known ML algorithms in detecting Parkinson’s patients but also studies popular datasets and approaches and compares their limitations and specifications. Increasing usage of artificial intelligence (AI) in medicine and disease detection has shaped the scheme of our research on diagnosing PD. Finally, our research may assist in developing more precise and reliable diagnostic methods for PD, improving patients’ quality of life as well as health conditions. We have addressed the following research questions to synthesize the relevant information and efficiently convey knowledge to the research associations:

  • What types of data sets are used to diagnose PD?

  • Which category of data sets is used the most to diagnose PD?

  • What tools are used the most in assessing ML approaches in diagnosing PD?

  • What evaluation metrics are commonly used to assess the ML techniques in diagnosing PD?

  • What ML algorithms have been considered the most in diagnosing PD?

  • What validation methods are used in studies diagnosing PD with ML?

  • What are the major challenges, future trends, and open issues in diagnosing PD with ML?

Our investigation highlights the growing interest and rapid expansion of studies incorporating ML in diagnosing PD. The utilization of ML technology within healthcare systems, particularly in the context of neurodegenerative diseases such as PD, is becoming increasingly necessary as it continues to improve. In particular, this literature review offers valuable information for:

  • Healthcare and ML researchers: We offer a thorough review of the current state of ML approaches and methods for PD diagnosis, providing insights for those interested in further exploration in this field.

  • Neurologists and medical professionals: Understanding ML-based PD diagnosis methods, strategies, and tools could prove helpful for the diagnosis of neurodegenerative diseases.

The structure of the remaining sections of this SLR is outlined in Fig. 1 and the key abbreviations are defined in Table 1. The Related Work section discusses some related works and motivations. The research method, including the selection procedure and research questions, is covered in the Research Methodology section. The Section on Classification provides a comprehensive analysis and classification of the selected papers, emphasizing their primary advantages, disadvantages, and evaluation factors. An examination of the findings, potential trends, and outstanding issues is given in the Analysis of Results and Open Issues, Challenges, and Future Trends, respectively. Threats and limitations of validity are covered in the Threats to Validity and Limitations section. The conclusion is finally given in the final section.

Fig. 1
figure 1

Structural map of this SLR.

Table 1 List of abbreviations

Related work

This part of the paper investigates important reviews of the existing datasets and ML approaches for diagnosing PD. The mentioned studies are evaluated in terms of main idea, taxonomy, paper selection procedure, database investigations, open issues, and covered years, and their summary is also provided. The outcomes are listed in Table 2.

Table 2 Summary of related works

Khachnaoui et al.20 assessed ML and DL-based computer-aided diagnosis methods for PD, along with introducing single photon emission computed tomography (SPECT) and positron emission tomography (PET) for detection. They deliberated the pros and cons of hand-crafted ML techniques, concluding that DL approaches offered the most robust and dependable solution for feature extraction in the diagnostic domain. Additionally, Salari et al.21 proposed an SLR to evaluate the effectiveness of ML methods in detecting PD cases until the end of 2020. This review involved seven distinct phases, wherein they categorized various approaches and databases, providing statistical insights. Ultimately, their findings suggested that ML approaches offered practical utility in diagnosing PD.

In another study, Tanveer et al.22 offered a review of papers from 2013 to 2021 that focused on using deep neural network (DNN) and ANN to detect PD. The authors analyzed various data modalities and model performances, finding that convolutional recurrent neural networks (CRNN) models excelled in time-series accuracy. They also concluded that the performance of ANNs improved when clinical features were incorporated. Lastly, they outlined the pros and cons of these models and suggested topics for future research. Further, Sigcha et al.23 assessed 69 papers on using ML and DL to analyze motor and non-motor data from wearables. The selected papers were from 2012 to 2022, focusing on monitoring and diagnosing PD. Also, they provided some future trends and challenges.

Additionally, Skaramagkas et al.24 presented a systematic review leveraging DL approaches to discriminate PD symptoms effectively with motor symptoms. They investigated 87 papers from 2016 to 2023 over DL, considering speech, upper limb, facial expression, and gait data. Moreover, Amato et al.25 reviewed acoustic features and ML methods for detecting PD. They analyzed 102 papers published from 2017 to 2022. Their focus was on statistically evaluating the applied techniques and algorithms. Khanna et al.26 investigated the application of neuroimaging and ML methodologies in the diagnosis of different disorders, such as PD, Alzheimer’s disease (AD), and Schizophrenia. The distinctive aspect of this study resided in its incorporation of the most recently published scholarly publications.

Keserwani et al.27 examined various ML, meta-heuristic, and DL models for diagnosing PD. Additionally, they improved the accuracy of existing models by utilizing speech signal datasets. Furthermore, potential future directions were explored. Also, Islam et al.28 analyzed ML and DL methods for diagnosing PD, specifically looking at handwriting and wave databases. They explored different algorithms and the nuances of biomarkers to improve diagnosis. Findings suggested these techniques in diagnosing patients, though the study also outlined certain constraints and potential directions for future research. In another review, Sabherwal and Kaur29 assessed the effectiveness of ML and DL algorithms regarding PD detection. Various methods were analyzed. Also, their limitations and future trends were discussed. Papers were selected from 2013 to 2023 over the mentioned area. Giannakopoulou et al.30 also reviewed ML algorithms trained on data collected from wearable sensors and IoT devices for predicting PD. The authors surveyed 112 papers in an SLR manner, highlighting the best methods and tools. Moreover, some open issues and future challenges were discussed.

Furthermore, Rana et al.31 offered a review to identify the common and accessible AI algorithms and ML techniques for detecting PD patients. They reviewed 112 papers and examined them in terms of data and methods. The study results indicated that ML and biomarker data had the highest efficiency in PD detection. They also discussed some future trends and challenges. Zhang32 categorized ML-based techniques of PD diagnosis into 3 groups by assessing 51 papers from 2006 to 2019, namely discrimination between PD and healthy control (HC), differential diagnosis, and early PD detection. Also, results demonstrate that the use case of ML improves the accuracy of PD identification. Moreover, some challenges and future solutions were introduced. Also, Chandrabhatla et al.33 investigated the common ML models and computational techniques used for detecting PD from 1970 to 2020. This review was carried out by using the US National Library of Medicine PubMed database, showing the significant advances in the mentioned area.

Table 2 shows that while the covered time by previous papers is broad, they evaluated a limited number of studies. The works of Salari et al.21, Sigcha et al.23, Giannakopoulou et al.30, and Rana et al.31 cover fewer studies over more extended period. Moreover, only three papers (Rana et al.31, Keserwani et al.27, and Islam et al.28) offer classifications. Also, it should be noted that several studies did not examine datasets at all, including those by Rana et al.31, Sabherwal and Kaur29, Khachnaoui et al.20, Khanna et al.26, Giannakopoulou et al.30, Zhang32, and Chandrabhatla et al.33. However, out of these studies, the work of Skaramagkas et al.24 is similar to our investigations. In this SLR, 80 publications between 2016 and 2023 have been reviewed, but its authors did not present any taxonomy or study applied tools. Islam et al.28 conducted another close effort; however, it introduced a taxonomy that focused solely on handwriting and voice data, examining fewer experiments than ours from 2000 to March 2023 and did not present the employed tools. The closest in breadth, Rana et al.31, assessed a small number of studies, did not evaluate datasets, and did not comprehensively address open issues. In addition, it was not written in a systematic way.

Furthermore, the reviews by Amato et al.25, Chandrabhatla et al.33, and Khachnaoui et al.20 did not discuss open issues, did not present any taxonomy or classification, and were not written systematically. Given these gaps, a comprehensive SLR is needed that not only addresses these limitations but also clearly identifies open issues and future research directions. On the other hand, the statistical correctness of our work is enhanced by its careful evaluation of 133 publications published between 2021 and April 2024. The majority of prior evaluations were not structural and were missing some crucial pieces of information. They often failed to consider tools, present any taxonomy or classification, or discuss the merits, downsides, and unanswered concerns. As of right now, a comprehensive evaluation of the various datasets utilized for PD diagnosis with ML has not been published.

Research methodology

Researchers have introduced various techniques for diagnosing PD. This section describes the systematic phases of examining various ML approaches for diagnosing PD. Figure 2 illustrates that, in contrast to the non-structured review process, the SLR process reduces bias while identifying research directions and open issues by following the correct phases in a methodical sequence for investigating the literature19,34,35. SLR methods are based on well-defined instructions to identify, provide the desired results, analyze, and answer the defined questions. Therefore, this paper utilizes SLR guidelines to design a three-stage review procedure: planning, conducting, and documenting18.

Fig. 2
figure 2

Overview of the research methodology.

Planning

The planning phase originates by defining the reason for the research and culminates in establishing the methodology for the review protocol in the following manner:

Stage 1: research motivation

An SLR is needed to classify and compare published studies on diagnosing PD by utilizing a characterization framework. While the causes of PD are uncertain, medication and surgery can relieve its symptoms. However, early diagnosis of the syndrome and its progression rates may be crucial in determining suitable treatments. Therefore, scientists have developed various techniques to address this issue and improve PD’s diagnosis accuracy. This has raised the motivation to conduct a systematic review on the application of different ML approaches for diagnosing PD, categorizing them taxonomically, and presenting a concise comparative analysis of the applied datasets and techniques, as well as their potential limitations and challenges.

According to our findings, few papers give an in-depth review of this subject. Since the evaluations of previous studies had limitations, provided no analysis on applied tools, and evaluated insufficient datasets, presenting an SLR is essential. By contrast, our SLR carefully assesses 133 current papers to offer a systematic categorization of datasets utilized in ML for PD diagnosis, addressing significant gaps in prior research. This detailed assessment helps to improve diagnostic methods, achieve our following goals, and guide future advancements in the field.

  • To include papers that have recently been published

  • To outline any potential future works

  • To investigate evaluation factors and applied tools

  • To present a comprehensive taxonomy, concise statistical information, and simulation tools and precisely define the paper selection procedure

Stage 2: research questions

We have developed research questions based on our motivation to conduct an impartial and scientific examination of PD datasets and the ML algorithms utilized in detecting the disease at its early stages. Answering the defined questions identifies available issues on this topic, which could also assist in brainstorming fresh ideas throughout this documentation phase. Table 3 highlights these research questions (RQs):

Table 3 Research questions

Stage 3: review protocols

We have presented and implemented a review protocol, which includes a set of questions, the paper selection method, and data extraction in Stages 1, 2, and 3 of the conducting phases. Following the guidelines in18 and36, we sought feedback from a third-party expert with experience in conducting SLRs on ML and healthcare systems to evaluate the protocol before its implementation. We incorporated his recommendations to modify the protocol. In addition, we conducted a pilot study (including about 20% of the papers) to eliminate researcher bias and boost the extraction of data. We have improved the inclusion/exclusion criteria, search strategies, and the scope of the review during this stage. In addition, a protocol was designed based on37 and our prior experience in conducting systematic reviews38,39,40,41,42,43,44,45,46,47,48,49,50,51.

Conducting

The conducting phase of the research process begins with choosing the papers and ends with the extraction of data. This section attempts to clarify the process of seeking and selecting articles throughout the second phase of the SLR. This section consists of two parts. First, study selection explains the process of selecting papers, and second, data extraction and synthesis, expressing how we have accomplished the review.

Stage 1: study selection process

This section describes the process of identifying and choosing articles during the second phase of SLR. As demonstrated in Fig. 3, we choose papers in this timeframe employing a two-step procedure. One of the most crucial requirements of developing a research plan is finding all publications without bias. Standard strings were found and added to the search phrase to achieve this objective. A review methodology was also designed to help identify relevant and unbiased research. During the first selection procedure, all search terms associated with Parkinson and ML algorithms were chosen to ensure that no papers were missed. Throughout this stage, the following search terms were employed in the abstracts, titles, and keywords of six academic online databases:

  • Initial selection: we conducted a thorough search throughout many reputable databases, namely Elsevier, IEEE, Taylor & Francis, Springer, ACM, Nature, and Wiley, using the search phrase supplied below, which included keywords, titles, and abstracts. Our search, specifically utilizing the journal citation reports (JCR) as a source, spans 2021 through April 2024. This timeframe was selected due to the significant variety and volume of publications in PD detection employing ML methods. 729 papers related to the collaborative field of ML and PD detection were found as a result of this search. These papers were published in various sources, including journals, conferences, and book chapters.

Fig. 3
figure 3

Paper selection process.

Parkinson [AND]

(Supervised <OR > SVM < OR> “support vector machine” <OR> “linear discriminant analysis” <OR> “naive Bayes” <OR> “Machine Learning” <OR > LDA < OR> “Deep Learning” <OR > KNN < OR> “k-nearest neighbors” <OR> “Neural Network” <OR> “Neural Networks” <OR> “Decision Trees” <OR> “Random Forests” <OR> Gaussian <OR> “Latent Variable” <OR> Unsupervised <OR> “Data Clustering” <OR> “Dimension Reduction” <OR> Ensemble)

  • Final selection: the papers extracted in the previous step were examined, and the inclusion/exclusion criteria (Table 4) were applied. Survey papers, theses, non-English papers, books, non-peer-reviewed papers, conference papers, short papers, and book chapters were excluded. The selected papers were thoroughly read, and quality assessment was used to include only those papers that presented evaluating details and approaches. At the end of this stage, 133 relevant studies were chosen for qualitative evaluation.

    Table 4 Inclusion/exclusion criteria

Stages 2 and 3: data extraction and synthesis

We acquired data from a list of online academic databases. We provided it in a systematic format based on characterization aspects employing the guidelines above. A structured analysis was created by exploring the restrictions and potentials of the studied papers, as well as the perspectives on future studies, providing us with an exploration of the collective study findings. We extracted data from the examined papers, concentrating on dataset types, applied tools/algorithms, advantages, disadvantages, paper main ideas, and evaluation metrics in Section “Classification of the selected studies”. Then, based on the data extracted in Section “Classification of the selected studies”, the results of this study were analyzed, and the findings were discussed and addressed alongside the research questions in Section “Analysis of results”. Furthermore, the review of the papers in Section “Classification of the selected studies” and the data extracted from the studied works in Section “Analysis of results”, highlight key challenges, open issues, and future trends in applying ML for PD diagnosis, which are comprehensively detailed in Section “Open issues, challenges, and future trends”.

Documenting

In the documenting phase (see Fig. 2), observations are recorded, and the discussion of threats to validity and limitations is provided in Section “Threats to validity and limitations”. The results are evaluated, visualized, and presented in Section “Analysis of results”.

Classification of the selected studies

This paper presents a classification designed to achieve two major objectives. First, it facilitates an SLR of the extensive body of research on PD diagnosis using ML approaches. Second, organizing the studies into five data-driven categories—illustrated in Fig. 4—enhances readers’ understanding of the main research areas in the field. The proposed classification highlights the important role of datasets in diagnostic results. For instance, the adoption of multimodal datasets demonstrates growing interest in innovative approaches that leverage diverse data sources to improve the accuracy of PD diagnosis. Although alternative classification exists, our classification, to the best of our knowledge, has not been adopted in earlier reviews. By concentrating on datasets, this work offers a unique perspective that emphasizes both the pivotal role of data in ML performance and the existing gaps in data diversity and availability.

Fig. 4
figure 4

The classification of PD analysis.

This section also investigates 133 papers to determine the PD dataset, diagnosis approaches, objectives, and applied approaches. In addition, we provide the advantages and disadvantages of these approaches by using assessment metrics defined in Table 5. Moreover, Fig. 4 illustrates the proposed classification for PD; while presenting a classification for PD is not a trivial task, it allows readers to easily access dataset resources and the related papers, providing a holistic understanding of the field and helping to identify research gaps. However, each researcher may adopt a different approach to categorization due to their unique perspective.

Table 5 Definition of evaluation metrics

Furthermore, the selected paper applies ML approaches to speed up the diagnosis of PD using proper datasets. These datasets are classified into five groups: acoustic data/features, medical imaging, movement data, biomarkers, and multimodal datasets.

Figure 4 presents the categorization used in this review, which is based on the datasets utilized for diagnosing PD. Five major categories were identified, each reflecting the unique characteristics and data extraction methods of their respective datasets.

Acoustic data/features

Acoustic data is crucial for diagnosing PD, as audio symptoms can be detected early, often before noticeable motor symptoms appear52. Therefore, diagnosing PD through this information can significantly enhance the quality of life for the affected individuals2. Symptoms of PD can be classified into two categories: dysphonia and dysarthria53. In the early stages of diagnosis, the identification of dysphonic indicators can delay the severity of the disease. Furthermore, recording voices is a relatively low-cost approach, and physical examination can be time-consuming54. Therefore, having sufficient data for training ML models could significantly expedite PD prediction and improve patient well-being. In Table 6, we present data from databases related to acoustic-based papers, including the number of patients and healthy subjects, the gender of patients, and details such as extracted features, audio file format, and the number of samples. Table 7 compares the reviewed studies, highlighting their advantages, disadvantages, main ideas, applied tools, and algorithms. The evaluation parameters employed in acoustic-based papers are analyzed in Table 8. In Section “Review of acoustic-based approaches”, acoustic-based approaches are reviewed. Ultimately, Section “Qualitative analysis of acoustic-based approaches” presents a qualitative analysis of acoustic-based approaches, examining their strengths, weaknesses, opportunities, and threats.

Table 6 Acoustic datasets
Table 7 A comparison of acoustic data/features papers
Table 8 Acoustic evaluation metrics

Review of acoustic-based approaches

Yao et al.55 proposed a model based on deep convolutional neural networks (DCNN) along with an IP-based whale optimization algorithm (WOA). This study aimed to diagnose pathological speech in PD and cleft lip and palate patients. According to the results, the proposed model provided high precision and accuracy. Also, Khaskhoussy and Ayed56 proposed a method to detect PD by analyzing speech data, and they also used SVM and convolutional neural network (CNN) to classify speech data. The results showed that this method has high accuracy and specificity. Meanwhile, Celik et al.57 developed the SkipConNet + RF diagnostic model for PD detection via speech signals. The model integrated CNN and RF algorithms. SkipConNet extracted vital speech signal features, and the RF predicted these features. According to the study, the presented model surpassed other DL and ML methods in accuracy.

Following a similar investigative field, Ali et al.58 introduced an ensemble method named ensemble model with optimal features and sample-dependent base classifiers (EOFSC) for identifying individuals with PD. The technique addressed issues of generalization and low accuracy by analyzing voices. Also, Masud et al.59 developed the crow search and deep learning (CROWD) model for diagnosing and classifying PD. The model employed the adaptive crow search algorithm (ACSA) to select the compressed feature vector and a DL-based autoencoder to generate the compressed feature vector. The outcomes also indicated that the proposed model achieved high sensitivity, specificity, and accuracy. Yücelbaş60 introduced a model to diagnose PD in the early stages using acoustic signals. Based on the results, the information gain algorithm-based k-nearest neighbors (IGKNN) model had high accuracy and recall.

Regarding the detection of PD with ML approaches in a quicker manner, Wang et al.61 introduced a system for diagnosing PD using speech data and DL methods. Also, results showed that the mentioned system was highly accurate. Moreover, Dhar54 designed a two-phase reduction system with the objective of diagnosing PD in its early stages. The first reduction steps comprised ML-based feature selection, whereas the other phase employed unsupervised techniques. High performance, as indicated by accuracy and AUC, was demonstrated through the comparisons and assessment findings. Li et al.62 introduced a method called two-step transfer learning (TSTL) to identify Parkinson’s patients from healthy individuals based on speech patterns. Also, their method was evaluated with a new and publicly available dataset. In addition, the results showed that the presented method had high accuracy, sensitivity, and specificity.

Moreover, Vital et al.63 presented a model based on neural networks to detect individuals with PD by analyzing their voices. They compared the method with various ML algorithms, including naive Bayes (NB), random forest (RF(, and AdaBoost. Based on the results, the accuracy and efficiency of the model were confirmed. In a related study, Jyotiyana et al.64 presented an approach for PD diagnosis using DL techniques. In this study, audio data were used to evaluate the proposed approach, and high accuracy was achieved. Masood et al.65 developed a two-level ensemble-based characteristic selection model aimed at early detection of PD using voice data analysis. In addition, this model was highly accurate. Similarly, García et al.66 offered a high-accuracy two-level ensemble-based attribute selection model aimed at early detection of PD using voice data analysis. In a similar vein, Hireš et al.67 suggested a method based on a CNN ensemble for diagnosing PD using voice data. Additionally, the results showed that this method provides high ROC (AUC), specificity, and accuracy.

Ma et al.68 presented a deep dual-side learning ensemble model based on the deep sample learning (DSL) algorithm, combined with deep feature learning, for recognizing PD via speech. An embedded deep-stacked group sparse autoencoder (EGSAE) extracted high-quality deep features. The results demonstrated the method’s high accuracy. Nevertheless, this approach required further validation and rebuilding of deep samples using a deeper neural network. Quan et al.69 developed a model to diagnose PD via speech signals and an end-to-end learning architecture. The proposed method employed a combination of time-distributed two-dimensional CNN (2D-CNN) and a one-dimensional CNN (1D-CNN) to extract dynamic features from speech signal time series and remove interdependence. Moreover, the method demonstrated high accuracy based on the obtained results.

In a similar vein, Chen et al.70 suggested an IP-based chimpanzee optimization algorithm (IPChOA) to enhance DCNN. Furthermore, the effectiveness of this method was assessed using speech signals from both PD and cleft lip and palate cases. The results showed that this method reached a high level of accuracy. Meanwhile, Khaskhoussy and Ayed71 presented an approach for diagnosing PD using ML techniques and speech processing, while also identifying early changes associated with the condition. They also used features like the mel-frequency brain coefficients-gaussian mixture model (MFCC-GMM). This study demonstrated high accuracy. In addition, Biswas et al.72 introduced an approach for early diagnosis of PD using an ensemble-based ML model called the ensembled expert system for diagnosis of PD (EESDPD). The suggested method had high accuracy, recall, and F1-score.

Wei Liu et al.73 developed a model based on speech features and an ANN to diagnose PD. The results showed that the proposed model has high accuracy, sensitivity, and AUC. Also, Yuan et al.74 investigated the detection of PD utilizing speech signals and ML approaches. Using the ReLU activation function, the authors developed a DNN architecture with multiple concealed layers. In addition, the minimum redundancy-maximum correlation (mRMR) method was used to identify key features. The outcomes showed that the suggested method obtained a high F1-score, accuracy, and MCC. Moreover, Kamalakannan et al.75 evaluated different classification methods for diagnosing PD. J48, SVM, and multilayer perceptron neural network (MPNN) methods achieved the highest accuracy.

Saleh et al.76 presented a method for identifying PD based on audio data as well as ML and ANN approaches. Additionally, this method achieved a high level of accuracy. Also, Devarajan et al.77 examined the application of ML methods to enhance the quality of diagnoses in the healthcare field, with a specific focus on PD. In addition, they have devised ML ensembles to identify PD, emphasizing the utilization of nonclinical patient data for early detection.

Guatelli et al.78 offered a technique for diagnosing PD based on neural networks utilizing the extreme learning machine (ELM) and acoustic signal spectrograms. The results demonstrated the practical application potential of this method, particularly considering the lower cost and shorter time required for training compared to traditional CNN-based methods. Additionally, the suggested method had a high level of accuracy. Also, Hireš et al.79 measured the efficiency of two varied ML models, namely CNN and XGboost, to detect computerized PD. The Authors evaluated their model regarding accuracy, specificity, sensitivity, and AUC via four datasets. The result revealed that even if the result in a database is acceptable, it could not be the same as others.

Pah et al.53 evaluated the efficacy of ML approaches in PD identification. They also investigated other voice disorders, such as laryngitis and dysphonia. Moreover, Eguchi et al.80 proposed a methodology that employed a transformer model to distinguish between patients afflicted with PD and those with spinal cerebellar degeneration using speech data. Also, accuracy and AUC metrics for the proposed model were comparatively high. In addition, Dhanalakshmi et al.2 provided a technique that uses speech features and ML techniques to detect early-stage PD. In addition, they focused on optimizing classification performance and addressing imbalanced datasets. The results of the proposed method indicated that it has high accuracy and AUC (ROC).

Iyer et al.52 proposed a method leveraging data obtained from recorded sounds and applying ML approaches to distinguish Parkinson’s patients from healthy individuals. Also, the proposed method achieved a high AUC (ROC). Ali et al.81 developed a two-stage diagnostic system based on ML and DL to improve PD diagnosis using speech data. In addition, the proposed system provided acceptable accuracy.

Qualitative analysis of acoustic-based approaches

An analytical review of acoustic-based approaches has been conducted, highlighting the key strengths, weaknesses, opportunities, and threats, which are discussed in detail below:

  • Strengths: Our analysis of acoustic-based approaches reveals three considerable advantages. Firstly, it is low-cost. Secondly, it is non-invasive, so it does not provide any thread for the patients. Lastly, the solution they provide is scalable and can be utilized in real-world settings. Moreover, the very first important feature is that vocal symptoms may appear years before other symptoms, and as a result, it makes early detection possible. The accessibility is outstanding since it can be used as a mobile application, making it widely available and suitable for real-world usage.

  • Weaknesses: On the other hand, it needs to be mentioned that the diagnosis models may not work effectively in noisy environments, leading to misdiagnosis. The lack of diverse language and culturally rich datasets may produce bias in results. On top of this, models in this approach did not suggest a combination of vocal data with other forms of data, such as movement and medical imaging.

  • Opportunities: Addressing these weaknesses can improve model reliability. Employing noise reduction and advanced techniques in speech analysis is essential. Moreover, using diverse databases that reflect a variety of ethnicities and languages may improve the model’s capabilities. The combination of acoustic data and other forms of data, widely utilized to detect PD, may lead to a better resolution.

  • Threats: Several vital concerns should be addressed in future work. As the volume of data increases, maintaining standardization becomes more challenging, which may directly impact the real-world applicability of the models. Additionally, reliance solely on acoustic data could lead to misclassification.

Medical imaging

Neurological disorders directly affect the brain; therefore, a visual brain image can be suitable for detecting a disease82. PET and SPECT are two approaches that can detect PD; however, they are less commonly utilized due to their cost and invasiveness83. Nevertheless, alternative methods, such as dopamine transporters (DAT) scanned applying 123I-ioflupane and magnetic resonance imaging (MRI), can be employed alongside ML for detecting PD84. Table 9 presents the databases utilized in medical imaging, including such details as the name of the data source, type of images, number of samples, and additional information like the participants’ age and data format. Table 10 compares the papers based on their main ideas, employed tools, and algorithms, as well as their advantages and disadvantages. Additionally, Table 11 provides the critical parameters used in the evaluation process. Section “Review of medical imaging-based approaches” reviews medical imaging-based approaches. In the end, a qualitative analysis of approaches, including strengths, weaknesses, opportunities, and threats, is performed in Section “Qualitative analysis of medical imaging-based approaches”.

Table 9 Medical imaging datasets
Table 10 A comparison of medical imaging papers
Table 11 Medical imaging evaluation metrics

Review of medical imaging-based approaches

Zhao et al.85 presented a CNN-based method utilizing diffusion tensor imaging to assess the diagnostic performance of a hybrid architecture for detecting PD across multiple brain regions. The study also employed a greedy algorithm to combine various areas for final prediction. The proposed method demonstrated a high AUC according to the results. Likewise, Shibata et al.86 suggested a model to identify moderate cognitive impairment (MCI) in PD patients using quantitative susceptibility mapping (QSM) images. The study utilized ML algorithms, including light GB, extreme GB, and RF. The results indicated that RF achieved a satisfactory level of performance and accuracy.

Gaurav et al.87 introduced a CNN-based framework for automatically assessing and segmenting neuromelanin in the SN of early-stage Parkinson’s patients. According to the results, the proposed framework achieved high accuracy and repeatability. Furthermore, the method could process large datasets in a significantly shorter time. Also, Nakano et al.88 investigated the effects of motor and non-motor symptoms on health-related quality of life (HRQoL) and identified brain networks associated with using MRI in Parkinson’s patients. Next, Dünnwald et al.89 suggested a CNN model for automating the extraction of biomarker data from multi-rater segmentation and multi-scale localization to detect individuals with PD. They validated their approach by evaluating the model using metrics such as the DSC and Euclidean distance.

Adams et al.90 presented a technique to predict the motor performance of individuals with PD. They employed CNN to interpret data from a combination of DAT SPECT imaging and clinical measures, including motor segment assessment. The results demonstrated that this DL-based combination improved the prediction of movement performance in Parkinson’s patients. Also, Shin et al.91 presented an algorithm that employs a DL-based CNN to interpret nigrosome 1 susceptibility map-weighted imaging (SMwI) to assess nigrostriatal degeneration in idiopathic PD (IPD). The results demonstrated that their proposed method provides rapid and precise results for IPD diagnosis. Similarly, Xu et al.92 presented a technique using a topological ML approach and resting-state functional MRI (rs-fMRI) data to develop a biomarker for early PD detection and treatment evaluation. Based on the findings, the suggested method reached a high degree of accuracy.

Noella and Priyadarshini93 introduced a DCNN-based model to identify PD and AD using fluorodeoxyglucose PET brain scans. The results showed that the model had high specificity, sensitivity, and accuracy. Furthermore, Camacho et al.94 developed an explainable model to identify Parkinson’s patients based on DL methods utilizing T1-weighted MRI data. The results of the presented model showed high precision and accuracy. In the same vein, Vyas et al.95 introduced two distinct neural network models, a 2D CNN and a 3D CNN, to detect PD using MRI scans. In addition, both models demonstrated acceptable effectiveness and performance.

Yasaka et al.96 proposed a CNN model to assess the potential of detecting PD individuals using parameter weighting and the number of streamlines. Results showed moderate performance in terms of AUC. However, the external validation of the presented model’s performance had not been conducted. Additionally, Dotinga et al.97 developed an SVM-based model to distinguish PD patients from healthy people using I-123 FP-CIT images. This model showed high accuracy, sensitivity, and specificity.

Piccardo et al.98 used a 3D CNN-based method combined with the analysis of brain [18 F] DOPA PET/CT scans to diagnose PD. This method showed acceptable accuracy and robustness. Moreover, Sun et al99. introduced a radiomic DL model. This study used [18 F] FDG PET imaging to diagnose PD, and the presented model showed significant accuracy. Nakajima et al.84 proposed an ML approach to detect dementia and PD based on 123I-ioflupane images. According to the results, the proposed approach had a high AUC (ROC).

Huang et al.100 proposed a method for improving the diagnosis of PD through emotional facial expressions and DL techniques. The presented model was evaluated using four datasets and achieved high accuracy. Also, Abdullah et al.101 introduced a framework based on TL to diagnose PD through handwriting analysis. Features gathered from the model were optimized using a GA. The evaluations showed high accuracy and effectiveness of the framework. In the same light, Pang et al.102 employed ML to assess a model utilizing multi-level indicators of resting-state functional magnetic resonance imaging (rsfMRI) for detecting different motor subtypes of PD patients. The findings showed that the proposed model achieved a notable AUC value.

Balnarsaiah et al.103 demonstrated a method for diagnosing PD based on DL approaches applied to MRI data. The Residual network (ResNeXt) architecture was utilized in this study to classify brain MRI images to identify Parkinson’s patients. As a result, the suggested method demonstrated significant accuracy in this research. Also, Xinchun Cui et al.104 presented a method for classifying PD using MRI T2 slices. By combining DL with a multi-branch feature processing module and multi-scale attention guidance, the authors created a method to extract features to enhance classification performance. Based on the outcomes, the proposed method obtained a high percentage of sensitivity, F1-score, accuracy, and specificity, demonstrating its diagnostic efficacy.

Wang et al.105 offered an approach based on DL to detect PD with QSM and T1-weighted information automatically. The approach compromised CNN and squeeze and excitation (SE)-ResNeXt50 models that analyze image data. Evaluation results demonstrated the mentioned model’s performance in terms of AUC. Also, Keles et al.106 offered a DNN model for identifying PD patients by the SPECT images. They used 1231 images to validate their models. Outcomes showed the quality of the model concerning specificity, precision, accuracy, F1-score, and sensitivity. Besides, Khachnaoui et al.107 presented a computer-aided system to diagnose PD via pre-trained CNN models, bilinear pooling, and TL. These models were trained by ImageNet. The results showed the model’s performance in terms of accuracy.

Zhang et al.108 presented a method for predicting PD using graph neural networks (GNNs) applied to MRI data. This study addressed two main issues: the efficiency of constructing graphs from MRI data and the overfitting of small data. Also, Ding et al.83 developed a modified transcranial sonography (TCS) technique employing the DCNN model to predict PD. Further, this model demonstrated higher accuracy, sensitivity, PPV, and F1-score. Moreover, Veetil et al.109 proposed a method using CNN models and T1-weighted MRI data to diagnose PD. In addition, simulations were used to investigate the problem of data leakage and high accuracy was also achieved. Tran et al.110 developed DL models for diagnosing PD based on retinal fundus images. Also, they achieved acceptable NPV and sensitivity.

Qualitative analysis of medical imaging-based approaches

The narrow and critical review of medical-imaging-based approaches papers derived key strengths, weaknesses, opportunities, and threats, detailed below:

  • Strengths: Two of the most apparent results of reviewing the medical imaging-based approach are that first, via image analysis, a good understanding of brain visualization and its relation to PD has been attained. Secondly, it has become clear that CNN is one of the most successful methods for PD diagnosis.

  • Weaknesses: However, this approach has three major points: it is too expensive due to the need for expertise and clinical tests such as MRI, the techniques employed in this approach are primarily invasive, and it needs high computational resources.

  • Opportunities: Resolving these conflicts may increase the reliability of medical-imaging-based approaches. Examining less invasive or non-invasive techniques, making the applications broader. Also, more cost-effective approaches, such as fMRI and DAT scans, can enhance accessibility. Another way could be to combine imaging with genetics or other data to increase the accuracy of diagnosis. Also, using ML as an interpreter for data instead of data analysts may reduce the costs.

  • Threats: Nevertheless, bias may rise if we do not consider diversity in gathering imaging data, so researchers should consider global populations in their test samples. Another challenge is logistical and ethical in gathering such information, which also needs to be addressed.

Movement data

Movement datasets are divided into three categories: gait, tremor, and movement111. Such information is mainly utilized to determine the severity of the condition. First, monitoring gait data is significantly invaluable because it can be gathered through wearable devices, which are primarily low-cost112. Also, continuous data can help to find appropriate therapy for the patient and examine its effectiveness. However, it cannot be a good approach for diagnosing the disease. Second, tremors are uncontrollable body shaking and are mainly misdiagnosed; suitable data by using ML can reduce diagnostic errors113. The movement datasets utilized in the examined papers are detailed in Table 12, which includes the names and references of the sources, the categories of movement data, descriptions of the features, and information regarding the age and gender of the participants. We tabulate the main ideas, applied algorithms, tools, advantages, and disadvantages in Table 13. Table 14 also includes assessment parameters for the reviewed studies, such as accuracy, sensitivity, specificity, and F1-score. In Section “Review of movement data-based approaches”, movement data-based approaches are reviewed. Additionally, Section “Qualitative analysis of movement data-based approaches” presents a qualitative analysis of movement data-based approaches, including their strengths, weaknesses, opportunities, and threats.

Table 12 Movement data datasets
Table 13 A comparison of movement data papers
Table 14 Movement data evaluation metrics

Review of movement data-based approaches

Vidya and P8 offered a method aimed at detecting and classifying the severity of PD. The proposed approach uses a multi-class support vector machine (MCSVM) along with gait data analysis. This method showed high accuracy and sensitivity. Plus, Nilashi et al.4 suggested a method that combined the deep belief network (DBN) and adaptive neuro-fuzzy inference system (ANFIS) to improve the accuracy of forecasting the unified PD rating scale (UPDRS) and diagnosing PD. Additionally, the results demonstrated a significant reduction in time complexity while enhancing prediction accuracy in the proposed approach, thereby supporting the early diagnosis of PD through precise and efficient UPDRS prediction. Also, Hua Sun et al.114 presented an approach for forecasting FOG in PD by integrating deep features acquired via the ResNeXt network with manually selected gait features. The results demonstrated that the approach exhibited high specificity, sensitivity, and accuracy.

Ullrich et al.115 developed an algorithmic approach utilizing inertial measurement units (IMU) to evaluate and diagnose Parkinson’s patients’ gait tests. This study aimed to reduce patient interaction with the recording system and reduce the amount of manual data annotation performed by researchers. In addition, this algorithm achieved a high F1-score, recall, and precision. Additionally, Rezaee et al.116 proposed an ML-based approach, incorporating deep transfer learning (DTL) using electromyographic signals for PD diagnosis. Moreover, the presented method achieved high accuracy and sensitivity.

Zhao et al.117 introduced a method to detect the severity level of PD through gait data and the ensemble K-nearest neighbor (EnKNN) algorithm. The suggested method effectively handles the imbalanced data distribution from Parkinson’s patients. Additionally, the proposed EnKNN exhibited favorable performance and accuracy compared to other methods, as demonstrated by the results. Besides, Borzì et al.118 used a multi-headed CNN for detecting freezing of gait (FOG) in PD. The method utilized inertial sensor data. Plus, this study fastened processing times, minimized memory usage, and high accuracy. Likewise, Shcherbak et al.119 introduced an approach for diagnosing early-stage PD (phases 1 and 2) using wearable sensors, movement data, and ML. Overall, distinguishing between healthy individuals and stage 2 patients resulted in improved outcomes, including higher F1-micro scores and precision.

Pedrero-Sánchez et al.120 suggested a multi-branch CNN-based method to classify PD and Alzheimer’s patients from healthy subjects using functional mobility test data. The results showed that the proposed method provided higher accuracy than parameter-based methods. Moreover, Kovalenko et al.113 presented an approach to detect PD in essential tremors utilizing ML techniques by analyzing video data. Further, this approach obtained a high F1-score.

Gazda et al.121 presented a model that diagnosed PD by evaluating people’s handwriting using CNN. Experimental results confirmed the high accuracy of the mentioned model. However, the authors did not investigate their approach with larger datasets or employ any other network architecture. Similarly, Ibrahim et al.122 offered a neural network model capable of predicting PD motions across multiple stages and minimizing delay using tremor data. Based on the evaluation results, the model was highly accurate. Similarly, Kaur et al.123 utilized the data on walking patterns to classify Parkinson’s and multiple sclerosis patients. Additionally, they evaluated different DL and ML methods, while CNN showed higher accuracy.

Lin et al.124 proposed a neural network-based method that utilizes movement data for the early detection of PD and classification of its severity. This model demonstrated high accuracy. Also, Exley et al.125 presented a method for measuring PD symptoms using ML and quiet standing data. In this method, the AUC (ROC) was high. Furthermore, Li et al.126 presented a DL-based framework to identify PD in rats with brain abnormalities. The authors collected 3D movement data and employed DL for classification. The evaluated results indicated the high accuracy of the model.

Cesarelli et al.127 investigated the predictive capability of upper limb features using ML algorithms to differentiate between normal people and those with PD. This study achieved high accuracy. Besides, Alissa et al.128 presented a CNN-based system for detecting PD based on drawing data. As a result of the proposed method, high accuracy was demonstrated. Additionally, Wang et al.129 developed an ML-based hybrid signal-processing approach for the detection and severity estimation of PD. Five different ML methods were evaluated in this study. The accuracy evaluation results showed the high performance of the SVM model.

Ma et al.130 provided an interpretable architecture using DL models and stepping-shoe pressure sensors for the early detection of PD. Results showed that the proposed architecture achieved high identification rates and accuracy. Additionally, Ferreira et al.131 developed an approach aimed at improving PD diagnosis and stage identification. This study evaluated five ML methods and analyzed the spatial-temporal characteristics of gait. The presented approach showed acceptable accuracy and AUC (ROC).

Valla et al.132 extracted tremor-related features from the archimedean spiral drawing test to improve PD diagnosis through ML methods. The authors employed filter methods like Fisher’s score and wrapper methods like RFE for relevant feature selection. The method demonstrated high accuracy, specificity, and sensitivity in the results. Furthermore, Tavares et al.133 introduced an unsupervised algorithmic pipeline called uTUG, which employed IMUs for motor assessment and PD diagnosis. Their method significantly enhanced precision, sensitivity, and F1-score in detecting timed up-and-go (TUG) assessments using ML algorithms.

Soumen Roy et al.134 introduced an ML model that utilized keystroke dynamics (KD) features during text typing on a conventional keyboard. The model employed a bootstrap-based homogeneous ensemble classification architecture along with ML techniques to detect PD in its De-novo and early stages. Among the ML methods examined, XGBOOST exhibited superior performance. Moreover, the proposed model exhibited notable sensitivity and specificity. Also, Huan Zhao et al.135 proposed a method to diagnose PD by analyzing distinct gait pattern characteristics. They identified features like asymmetry index, mean, and coefficient variance from Parkinson’s patients’ gait patterns. Also, ensemble DT made an improvement in the diagnostic process and accuracy. In the same light, Mirelman et al.136 introduced a method that uses the data from wearable sensors and natural language approaches to identify the gait and mobility criteria of different stages of PD. This study aimed to specify optimal sensor locations for each disease stage, and the results showed high sensitivity, specificity, and AUC for the proposed method.

Kumar et al.137 presented this study to identify the handwriting dataset’s most effective task and establish a reliable diagnosis method. The authors employed two variants of RNN, a DL technique based on bi-directional long short-term memory (BLSTM) and LSTM. In addition, the kinematic properties obtained through various ML approaches were examined. The proposed procedure was highly accurate, based on the outcomes. Besides, Zhao et al.138 introduced a neural network approach to detect PD by classifying individuals’ handwriting. The comparison results reveal that the presented model performs better than its counterparts. Nevertheless, the size of the data could be expanded to improve.

Talitckii et al.139 presented a method for identifying the most effective exercises for diagnosing PD using ML approaches and wearable sensors. Three of the 15 common exercises with the highest discrimination power achieved a high AUC (ROC) score. In addition, this method could improve PD diagnosis, reduce hospitalization costs, and reduce the time required for diagnosis. Moreover, Chen et al.140 implemented an optimizable model using CNN architecture to detect PD accurately from daily walking and adapt according to the most indicative spatiotemporal motor characteristics. The data was gathered from 100 subjects while walking 10 meters, monitored by five sensors attached to their bodies. Moreover, the results indicated high accuracy and AUC for the model.

Sotirakis et al.112 offered an approach to monitor the progression of motor symptoms in Parkinson’s patients using ML methods. Also, the data used was gathered from wearable sensors. This study obtained an acceptable RMSE. Besides, Varghese et al.141 investigated different ML methods in order to diagnose PD and differential diagnosis. Furthermore, they presented a movement dataset from wearable technologies to help develop accurate diagnostic tools. This study achieved acceptable accuracy. Also, Yang et al.142 introduced a method based on multi-level ensemble learning using keystroke dynamics for PD diagnosis. The presented method showed high accuracy. Cuk et al.143 offered an approach for early detection of PD using LSTM neural networks and gait data. The results showed that this approach was highly accurate.

Qualitative analysis of movement data-based approaches

In this section, we critically analyzed the selected paper on movement-based approaches. Also, the classification of strengths, weaknesses, opportunities, and threats has been detailed:

  • Strengths: The qualitative assessment of the selected papers in movement-based approaches led us to conclude that continuous patient monitoring aids in real-time assessment of PD and directly captures hallmark symptoms. The key advantage of movement-based approaches is online data gathering, which is done by applying wearable sensors.

  • Weaknesses: On the contrary, although highly accurate sensors may improve understanding of PD, they are expensive and may not be available to everyone. Capturing data in controlled environments may increase bias and cause other symptoms to be overlooked.

  • Opportunities: One of the key areas researchers should focus on is developing more affordable and high-tech wearable sensors to ensure their availability for everyone. Moreover, combining these data with vocal and biomarker data will help to get reliable and more accurate results. These real-time datasets should also be used extensively for PD progression monitoring and patient management.

  • Threats: However, there are concerns about the privacy of continually gathering PD patients and data sharing. Also, low-quality sensors directly impact the mode performance, and the result may be undermined.

Biomarkers

Parkinson’s latest reports suggest that the combination of genetic and abnormal brain activity can contribute to the development of the disease6. The analysis of genes and identification of key genetic and brain malfunctions accelerates PD detection, facilitating early diagnosis in potential patients144. This approach may enable the development of targeted therapies that address the specific genetic and brain-related factors associated with PD. Table 15 provides a summary of the biomarker datasets utilized in the reviewed articles, detailing the types of biomarkers, extracted features, demographic information, analytical methods, and sampling types. Table 16 presents the main ideas of the papers, applied algorithms, tools, advantages, and disadvantages. Additionally, Table 17 compares several evaluation metrics, such as accuracy, specificity, and sensitivity.

Table 15 Biomarker datasets
Table 16 A comparison of biomarkers papers
Table 17 Biomarker’s evaluation metrics

Section “Review of biomarker-based approaches” reviews biomarker-based approaches. In the end, a qualitative analysis of approaches, including strengths, weaknesses, opportunities, and threats, is performed in Section “Qualitative analysis of biomarker-based approaches”.

Review of biomarker-based approaches

Arora et al.1 compared different ML techniques to detect PD through amino acid composition and hydrophobicity. In the proposed approach, the recall and F1-score were high. Likewise, Xie et al.6 proposed a model that combined ML techniques, such as RF models and ANN, to diagnose PD. They also investigated the role of immune cell infiltration in PD. Moreover, Göker et al.145 developed a DL-based method for detecting PD in its early stages using EEG signals. In order to create the automatic model, welch spectral analysis was combined with BLSTM. Also, based on the outcomes, this method attained high scores for the evaluation criteria like precision, specificity, MCC, accuracy, sensitivity, and F1-score.

Lal et al.146 developed and evaluated an architectural pipeline for PD diagnosis using resting-state EEG. This paper suggested utilizing ML models with a focus on the KNN classifier. In addition, their method achieved high accuracy. Also, Soojin Lee et al.147 presented a model based on a CRNN composed of RNN, CNN, and gated recurrent units (GRUs) utilizing resting-state EEG. Additionally, CRNN displayed high recall, precision, and accuracy.

Chang et al.148 developed an attention-based sparse graph convolutional neural network (ASGCNN) approach for early PD detection using electroencephalography (EEG) signals. This study uncovered statistically significant disparities between patients with PD and healthy individuals. The proposed method obtained high levels of recall, accuracy, precision, and F1-score, according to the findings of the study. Additionally, Wang et al.149 suggested a urine sample-based method for detecting PD to identify specific metabolites as biomarkers for PD and a predictive model use ensemble ML techniques. This study identified eight metabolites that could distinguish between PD patients and healthy people using a combination of metabolomics and ML assessments. Based on the results, this method had a high AUC and accuracy.

Bhandari et al.144 presented an approach for diagnosing PD using blood-based gene data. This study employed ridge regression and LASSO for feature selection, alongside various ML methods—particularly LR and SVM—for data classification. In addition, the SHAP method was used to identify the essential genes responsible for diagnosing PD. According to the results, the proposed method provided high accuracy. Yang and Huang150 conducted a study to evaluate the efficacy of CNN and SVM in categorizing individuals with PD based on resting-state EEG data. The results demonstrated that the CNN approach outperformed SVM by effectively identifying important features, shortening the training time, and achieving higher levels of accuracy.

Shabanpour et al.151 developed a multivariate and data-driven model utilizing DCNN to analyze EEG data and identify spatial oscillatory patterns associated with PD. The model was intended to improve understanding of the brain physiology in PD while creating clinically interpretable topographical maps. In the same vein, Oliveira Coelho et al.152 proposed a diagnostic model for PD using Hjorth features derived from EEG signals. Patients with PD exposed to auditory stimuli had their data analyzed using SVM, KNN, and RF. Results demonstrated the model’s high accuracy in distinguishing Parkinson’s patients from healthy individuals, particularly when SVM was utilized.

Hosny et al.153 suggested a DL model founded on CNN-GA-KNN that improved the localization of the subthalamic nucleus (STN) using local field potentials (LFP) in patients with PD. The suggested model used a CNN to extract features and the GA to select features. In addition, KNN was used for classification. Furthermore, the results showed that this model was highly accurate. Additionally, Martinez-Eguiluz et al.154 developed ML models for the early diagnosis of PD using non-motor symptoms, such as autonomic dysfunction and depression. They evaluated the models using two databases, namely PPMI and Biocruces. SVM and MLP demonstrated the most promising results among all the algorithms.

Xu et al.155 proposed a DL-based system that investigated the metabolites and small molecules in saliva to detect PD at early stages (Hoehn-Yahr stage 1-2.5). The mentioned algorithm used 312 samples for validation purposes, and the results demonstrated its performance in terms of AUC, sensitivity, and specificity. Also, Zhang et al.156 introduced two methods for classifying clinical sleep EEG data, namely wavelet packet transform with deep residual shrinkage network (WPT-DRSN) and tunable Q-factor wavelet transform with deep residual shrinkage network (TQWT-DRSN). This model integrated time-frequency analysis and DL, demonstrating promising outcomes in classifying non-stationary signals. In addition, REM sleep behavior disorder (RBD) was investigated in conjunction with PD, and the proposed model demonstrated high accuracy in early detection of PD and disease tracking.

Dar et al.157 introduced an architecture called 1D-CRNN-ELM, which combined a CRNN and an ELM to detect six fundamental emotions in individuals with PD. The suggested architecture achieved high accuracy in classifying emotions and showed its effectiveness in EEG-based signals emotion recognition. Also, Nour et al.158 suggested an approach that classified PD using ensemble learning through EEG signals. The mentioned models were investigated in terms of several evaluation metrics, such as kappa score, ROC curve, and accuracy, demonstrating their performance. Moreover, Chu et al.159 offered an advanced framework for EEG microstates utilizing DNN to identify patients with PD in the initial stages. Also, brain regions were investigated for any probable relationship with PD. The assessments depicted that the model could increase accuracy.

Lin et al.160 developed a method for assessing and estimating the severity of impulse control disorder (ICD) comorbidity in patients with PD using ML. Furthermore, EEG measurements were obtained utilizing an inexpensive headset, which enabled the device to be implemented in routine environments. Moreover, Wu et al.161 investigated the potential biomarkers and therapeutic target genes for PD and confirmed the findings using experimental approaches. Also, this study used different ML algorithms, including LASSO and RF. In another research, Uehara et al.162 introduced an approach for diagnosing PD using sebum RNA profiles. In addition, they analyzed the profiles using ML methods. This approach showed high AUC (ROC). Additionally, Aljalal et al.163 demonstrated the efficiency of various entropy measures and discrete wavelet transform (DWT) combined with ML methods for diagnosing PD using EEG data. Also, the suggested method showed high accuracy.

Using longitudinal data from two PDBP and PPMI cohorts, Dadu et al.164 developed ML-based models to detect distinct subgroups of PD and predict disease progression. Based on the study findings, AUC (ROC) was high, indicating a framework for predicting PD progression up to five years before diagnosis. Besides, Park et al.165 presented a cost-effective method for predicting PD risk and enabling early detection through ML techniques and longitudinal health screening data. Cholesterol levels, blood pressure, and hemoglobin levels were identified as the most critical predictors. Moreover, the proposed method achieved a high AUC (ROC) for the neural network model. Further, Harvey et al.166 introduced an ML-based approach to forecast cognitive outcomes in PD patients. In addition, the proposed approach obtained high sensitivity. Karabayir et al.167 also provided a 1D-CNN-based method for predicting prodromal PD up to five years before clinical diagnosis by using 10-second ECG data. According to the findings, the method showed high specificity. Moreover, Leal et al.168 developed a model to predict PD patients who are at risk of developing dyskinesia using clinical and behavioral data. The presented model using the RF classifier showed high AUC (ROC).

Qualitative analysis of biomarker-based approaches

We analyzed biomarker-based approaches thoroughly and introduced their strengths, weaknesses, opportunities, and threats to the classification in the context of PD diagnosis using ML:

  • Strengths: The qualitative review led to two outcomes. First, having a strong biological view of PD may lead to early detection of the disease. Second, these approaches may be utilized for personal treatment and monitoring of disease progression.

  • Weaknesses: While biomarker data collection is invasive, it may result in patient willingness to do tests. Many variables among different populations, such as genes, may hinder the generalization of models. These methods are expensive; indeed, they need many experts and resources to be accomplished.

  • Opportunities: Resolving these concerns may involve combining biomarker data with imaging and other clinical data, potentially leading to higher accuracy. Employing advanced technologies and methods such as genomics and proteomics may accelerate identifications. Moreover, scientists may use non-invasive approaches such as saliva and urine biomarkers to increase the willingness among patients.

  • Threats: However, the quality of the collected data varies due to differences in data collection and storage methods. Also, there are ethical barriers to regulating and deploying biomarker data.

Multimodal

Multimodal datasets provide a complete PD diagnosis. These datasets contain biomarkers, medical imaging, movement patterns, and acoustic sounds. Improving the accuracy of early PD diagnosis through refined ML models requires the integration of various data input sources169, as a holistic perspective is essential. Table 18 presents the details of the dataset, including data collection periods, the number of participants, database names, multimodal data types, and supplementary information. Table 19 covers the multimodal dataset-based articles, including their main ideas, advantages and limitations, applied tools, and algorithms. Furthermore, the evaluation metrics are outlined in Table 20. In Section “Multimodal-based approaches”, multimodal-based approaches are reviewed. Section “Qualitative analysis of multimodal-based approaches” also includes a qualitative analysis of approaches, such as strengths, weaknesses, opportunities, and threats.

Table 18 Multimodal datasets
Table 19 A comparison of multimodal datasets
Table 20 Multimodal evaluation metrics

Multimodal-based approaches

Salmanpour et al.170 identified optimal feature combinations to evaluate and predict subtypes of PD using feature selection techniques and longitudinal datasets. In addition, various feature selection algorithms (FSAs), clustering, feature extraction algorithms (FEAs), and classification algorithms were utilized. Also, based on the findings, merging non-imaging data with SPECT-based radiomic features and the optimal use of hybrid ML systems (HMLs) enhanced the identification and prediction of PD subtypes in the fourth year. Similarly, Severson et al.171 presented a statistical progression model using longitudinal data that accounted for intra-individual variations, medication effects, and inter-individual differences to better understand the heterogeneous symptoms and progression of PD. They also used a contrastive latent variable model and a customized input-output hidden Markov model as part of their method.

Junaid et al.169 proposed an interpretable ML framework that utilizes multimodal data, including medication history, patient characteristics, and motor and non-motor data, to detect and predict early signs of PD. According to the results, the techniques were effective and accurate. Also, Khera and Kumar172 proposed a hybrid strategy to classify the severity of the PD based on ML approaches. According to the evaluation results, the model could detect PD accurately. Pahuja and Prasad15 presented the modal-level and feature-level frameworks based on DL architectures to improve the diagnosis of PD. In addition, this study used multi-modal features, including biological (CSF) and neuroimaging. The results also demonstrated that the provided frameworks had acceptable accuracy.

Richa Indu et al.173 introduced a modified KNN algorithm based on handwriting, gait, and voice parameters for PD diagnosis. The authors enhanced this algorithm by incorporating the δ-neighborhood for predicting the class of test samples and the concept of weights. The findings showed that the proposed approach was accurate. In another study, Sarica et al.174 suggested an ML technique using an explainable boosting machine (EBM) to classify SWEDD, PD, and HC. Additionally, they used imaging and clinical data in order to train their model. Despite the relatively small dataset, the findings demonstrated that the model performed remarkably well, achieving a high AUC-ROC score.

Parisi et al.175 proposed two activation functions, namely quantum rectified linear unit (QReLU) and modified-QReLU (m-QReLU), aimed at enhancing the performance of CNN in tasks such as medical image classification, PD diagnosis, and COVID-19 detection. The results indicated that the proposed approach exhibited high accuracy and reliability. Also, Hajianfar et al.176 offered a hybrid ML system to detect two essential genes, namely leucine-rich repeat kinase 2 (LRRK2) and glucocerebrosidase (GBA), using the PPMI database to diagnose PD. Also, several feature selection and feature extraction algorithms were used to reduce the number of variables and tackle overfitting. The evaluation results demonstrated high accuracy in performance.

Kanagaraj et al.177 offered a method for enhancing the accuracy of PD diagnosis by utilizing PPMI data and an ant colony optimization approach. The proposed method resulted in more accurate predictions with fewer features, enhancing computational efficiency. Additionally, it demonstrated high accuracy. Also, Aggarwal et al.178 proposed a method using 1-D CNN and data augmentation to classify non-PD, PD, and scans without evidence of dopamine deficit (SWEDD) and avoid misdiagnosis. Their approach achieved high-performance in precision, F1-score, recall, and accuracy. In another study, Templeton et al.179 applied ML to classify PD and its stages through tablet-based neurocognitive assessments. The model utilized multimodal data, including speech, memory, evaluations of motor function, and the CART algorithm. Furthermore, the proposed model achieved high accuracy. Makarious et al.180 presented a model for diagnosing and managing PD before patients recognize the signs and symptoms. This method was developed to identify PD by combining genetic, clinical, and demographic data with ML methods, such as AdaBoostClassifier and GenoML. According to the results, this method had a high AUC (ROC).

Qualitative analysis of multimodal-based approaches

Through critical review and analysis of the papers using multimodal-based approaches, we listed some of the most important strengths, weaknesses, opportunities, and threats of the classification:

  • Strengths: The primary advantage is the accumulation of both motor and non-motor symptoms, which drives the resolution. Also, in this category, all the forms of datasets, such as acoustic, imaging, movement, and biomarker, are combined to increase accuracy.

  • Weaknesses: However, these approaches require significant financial resources, as gathering clinical data involves various tests, and acquiring movement datasets may necessitate high-tech devices. Also, these need to be done with substantial computational resources and expertise, which makes the entire process too expensive.

  • Opportunities: Paying attention to the mentioned scenarios may pave the way for future researchers. Developing new fusion techniques for multimodal datasets may optimize the process. Also, multimodal analysis may reveal a new pattern of biomarkers that were not obvious through single-modal analysis.

  • Threats: The risk of data overfitting is inevitable and should be considered due to model complexity and limited data.

Analysis of results

The results of our systematic review procedure are discussed in this section. Section “Overview of the selected studies” presents an overview of the selected studies, while Section “Objectives” discusses the advantages, limitations, and differences among the various PD datasets and ML techniques. Analyzing classifications used in the dataset and helping to draw a future path for researchers are the ultimate goals of this study.

Overview of the selected studies

The goal of this study is to investigate state-of-the-art PD datasets currently used in diagnosis with the aid of ML methods. To achieve this, we have considered the following complementary questions (CQs):

  • CQ1: Which groups are actively involved in diagnosing PD with ML applications?

  • CQ2: Which publishing channels distributed the most papers?

  • CQ3: How are the publications and studies distributed on diagnosing PD with ML approaches per publisher?

  • CQ4: How are the publications and studies distributed on diagnosing PD with ML approaches per year?

  • CQ5: How do the studies address the Parkinsonian syndrome classification years before diagnosing PD and performing complex classification tasks?

These questions aim to provide a clearer understanding of the current landscape of PD diagnosis using ML and identify future trends and potential research directions.

  • CQ1: Which groups are actively involved in diagnosing PD with ML applications?

After synthesizing and selecting the papers, we extracted the authors’ affiliations. Table 21 presents a comprehensive list of universities and institutes that have contributed at least twice in this field. Researchers from the University of British Columbia, Canada; Skolkovo Institute of Science and Technology in Russia; the University of Surabaya in Indonesia; RMIT University in Australia; FAU in Germany; and Stanford University in United States published a significant number of research papers on PD diagnosis through ML approaches.

  • CQ2: Which publishing channels distributed the most papers?

    Table 22 shows that most JCR-indexed journal papers on PD diagnosis using ML approaches are published in BSPC, SR, NCA, MTAP, IEEE TONSRE, IEEE JOBHI, CIBM, and NPD. Table 22 provides a list of JCR-indexed journals that have published at least two related papers, including the publisher’s name and abbreviation, as well as the journal’s impact factor.

    Table 21 Active groups and their research focus
    Table 22 Distribution of papers by publication channel
  • CQ3: How are the publications and studies distributed on diagnosing PD with ML approaches per publisher?

    Figure 5 presents the distribution of publishers across the given subject annually. Figure 5A shows that Elsevier is the leading publisher, accounting for over one-third of the reviewed papers. In second place is Springer with 31%, followed by IEEE with 17% and Nature with 11%. Wiley has contributed 4%, and both Taylor & Francis and ACM each have a share of 1% of the total publications. Figure 5B–E showed the trends and number of studies about diagnosing PD with ML over the mentioned timeline based on publishers.

    Fig. 5: Annual distribution of the studied papers by publisher.
    figure 5

    Panels A–E present the relative share of different publishers for each year. A Year 2021-2024, B Year 2021, C Year 2022, D Year 2023, E Year 2024. The pie charts illustrate the percentage of papers published by each publisher in the corresponding year.

    The charts show that Wiley published four papers in 2021, while only two were published in 2023. Moreover, Elsevier’s contribution was greater over these years, except in 2024, when only two papers were published by April. Springer maintained a steady pace, contributing at least ten papers annually, except in 2024, with only 5 papers published until April 2024. Meanwhile, IEEE showed a contribution in 2023, with a total of 12 publications. In 2021, Nature published only one paper, while during the rest of the timeframe, it showed a steady contribution of at least four papers annually. On the other hand, ACM and Taylor & Francis have the smallest share of releases, with just one publication each, making them the two publishers with the fewest papers.

  • CQ4: How are the publications and studies distributed on diagnosing PD with ML applications per year?

Figure 5 illustrates a steady increase in publications, rising from 37 in 2021 to 42 in 2022, with an overall upward trend observed between 2021 and 2024. In the year 2023, the total number of papers reached its highest point of 43. Even though just 11 papers were published up to April 2024, the research in this field is improving.

  • CQ5: How do the studies address the Parkinsonian syndrome classification years before diagnosing PD and performing complex classification tasks?

According to Fig. 6, of the 133 studies reviewed, only 20% focused on the early diagnosis of PD, and none specifically addressed the classification of Parkinson’s syndromes. This highlights a significant gap in research within this field. While many studies focused on binary classification tasks, such as distinguishing between PD and HC, these approaches may not adequately capture the complexity and challenges involved in clinical diagnosis. In complex cases such as those mentioned, there is a need for longitudinal datasets and subtle signal changes. Future research should focus on the Parkinsonian syndrome classification years before diagnosing PD to increase the real-world effectiveness of ML in PD diagnosis.

Fig. 6
figure 6

Distributing articles on early detection of PD.

However, several studies2,4,15,54,59,65,71,72,77,87,92,124,130,134,143,148,154,155,156,164,165,167,169,170,174,179,180 also focus on more complex tasks, such as diagnosing the early stages of PD using non-motor symptoms. These symptoms include olfactory impairment, depression, rapid eye movement, sleep behavior disorder (RBD), and data obtained from fMRI, saliva, or blood tests. Additionally, research has aimed to identify PD from similar conditions, such as SWEDD. These approaches highlight the potential of ML to tackle various diagnostic challenges beyond simple binary classification.

Table 23 shows that the acoustic data and biomarkers category contains the most papers on the early diagnosis of PD, while the medical imaging category has the fewest articles.

Table 23 Distribution of studies on the early detection of PD across categories

Objectives

In this section, we answer some research questions by providing statistical and analytical investigation results according to Section “Planning”.

RQ1: What types of datasets are used to diagnose PD?

In our detailed review in Section “Classification of the selected studies”, we have distinguished the datasets used in the papers into five discernible categories–namely movement information, medical imaging, acoustic information, biomarkers, and multimodal– which has been provided in Fig. 4.

RQ2: Which category of datasets is used the most to diagnose PD?

Based on our detailed review in the fourth section and the answer to RQ1. This paper categorizes datasets into five groups. According to Fig. 7, about 25% of the studies utilized movement data, while acoustic data was slightly less common, appearing in 23% of the cases. Medical imaging and biomarkers jointly ranked in the third place with two tenths, respectively. Also, multimodal with 11% is a novel approach that drives scientists’ attention.

Fig. 7
figure 7

Percentage of papers included in this study for each classification.

RQ3: What tools are used the most in assessing ML approaches in diagnosing PD?

Figure 8 shows a collection of the tools demonstrated in Tables 6, 9, 12, 15 and 18. It shows the wide range of software and tools that experts have used in their work. Python ranks the most considerable proportion, accounting for 42% of usage, closely followed by MATLAB at 23%. R was used in 8% of the studies, while FSL, with 3%, was allocated the shortest rank. This data provides insight into the preferred computational tools in ML for assessing PD.

Fig. 8
figure 8

The proportion of tools utilized in the analyzed papers.

RQ4: What metrics are significantly used to assess the ML techniques in diagnosing PD?

Applying ML for PD diagnosis has been assessed using various techniques and measures. The measures include accuracy, TPR (sensitivity), precision, TNR, F1-score, AUC (ROC), and MCC. Tables 7, 10, 13, 16 and 19 offer thorough assessments applied in review papers. In addition, Tables 6, 9, 12, 15 and 18 detail the benefits and downsides of each categorization approach. For a more detailed view, our statistical analyses are visually represented in Figs. 9 and 10, which show the comprehensive percentage distribution of the evaluation metrics and the parameter estimations by category, respectively. Figure 9 indicates that accuracy is the most frequently reported metric, representing 20% of the cases, followed by sensitivity at 18%. Precision, F1-score, and specificity are also significant, with 12% each. At the same time, balanced accuracy stood at the lowest end with only 1%.

Fig. 9
figure 9

Percentage of evaluated parameters in studied papers.

Fig. 10
figure 10

The percentage of evaluation parameters in each category.

According to Fig. 10, Studies involving acoustic data mostly focused on accuracy (21.13%), with a similarly strong emphasis on Sensitivity (20.42%). A minor 9.86% of studies considered AUC (ROC), and none of the papers evaluated balanced accuracy. As the focus shifted to movement data, about one-fifth of papers aimed to enhance accuracy. In medical imaging, the focus on accuracy peaked at 18.05%, and sensitivity reached 17.29%, while AUC and balanced accuracy were considered in 9.02% and 0.75% of the studies, respectively. Biomarker research displayed accuracy with 21.74% derived the most attention; also, AUC received significant attention at 18.26% compared to another group. When it comes to multimodal, accuracy accounted for 21.05% of studies. However, balanced accuracy rated just 3.51%, and it was the highest among other categories. This suggests a relatively balanced focus across evaluation metrics. Across all fields, accuracy consistently emerged as a critical metric in PD detection using ML approaches, although other metrics, such as balanced accuracy, F1-score, and AUC, might better represent model performance.

The focus on some evaluation criteria for ML models in PD diagnosis may lead to an inadequate evaluation of the model’s actual performance. Figure 9 indicates that accuracy is the most frequently used criterion, warranting further investigation. Focusing insufficiently on class balance in the imbalanced dataset can reduce the effectiveness of accuracy as a performance evaluation metric. In medical data, the number of positive cases (patients) is typically much lower than that of negative cases (non-patients). Using accuracy as a metric does not effectively reflect a model’s ability to identify minority classes correctly. Balanced accuracy is a more suitable measure for evaluating an imbalanced dataset, which appeared in only 1% of the reviewed articles. Similarly, F1-score and AUC (ROC) are widely used but remain less common than accuracy. Many reviewed papers in our study used techniques such as oversampling, class weighting, and robust algorithms to handle imbalanced data. Although accuracy and F1-score are often preferred due to their ease of use, it is imperative to employ more robust measures such as balanced accuracy and AUC (ROC) to assess model performance fully.

RQ5: What ML algorithms have been considered the most in diagnosing PD?

Figure 11 shows the several ML approaches that significantly impact the diagnosis of PD. In this paper, we provide information based on the following categories: DL algorithm, regression algorithm, ensemble algorithm, reduction algorithm, clustering algorithm, ANN, optimization algorithm, feature selection algorithm, signal processing algorithm, and others. Figure 11 demonstrates that the most frequently used algorithms among all published papers were DL algorithms, which had a 20% share, followed by ensemble algorithms with 16%. SVMs, with 12%, ranked third in usage. Algorithms such as ANN, optimization, reduction, and regression algorithms each had usage rates ranging from 7% to 9%. In contrast, the least utilized approaches were signal processing techniques, clustering algorithms, and feature selection methods, with usage rates of 1%, 1%, and 2%, respectively.

Fig. 11
figure 11

Percentage of algorithms used in reviewed studies.

In addition, the algorithmic advantages across different study topics for PD diagnosis are clearly shown in Fig. 12A–E. Figure 12A shows that DL was the most frequently used algorithm, appearing 16 times across the studies. Ensemble methods, SVM, ANN, and optimization algorithms followed, with usage rates of 12%, 11%, 10%, and 9%, respectively. Regarding medical imaging, DL algorithms recorded the first usage with 21 instances, according to Fig. 12B, while the reduction algorithm had the lowest usage. Figure 12C shows that both the DL algorithm and the ensemble algorithm were each used 17 times, whereas feature selection received the lowest contribution ranking. In the biomarker category (Fig. 12D), ensemble algorithms had the highest usage with 17 instances, followed by DL with 13, SVM with 11, and reduction algorithms with 10. In the final section of the figure, the ensemble ranked first, followed closely by regression algorithms in second place.

  • RQ6: What validation methods are used in studies diagnosing PD with ML?

Fig. 12: The percentage of algorithms used in each category.
figure 12

Panels A–E depict the proportion of algorithms employed within the five categories of the proposed taxonomy: A Acoustic, B Medical, C Movement, D Biomarker, and E Multimodal. Each pie chart reflects the proportion of algorithms utilized within the corresponding category, based on the reviewed studies.

Figure 13 highlights that the reviewed studies commonly employ evaluation methods such as cross-validation and train-test split. About 49% of the studies used cross-validation to evaluate model performance. A total of 29% of the studies used Train-Test Split as an evaluation method, and only 19% used independent test sets to test the generalizability of the models. This is considered a significant limitation in evaluating the generalizability of the models. Furthermore, 3% of the studies did not directly mention the assessment method, raising concerns about the reliability of the results. As a result of the lack of transparency in this field, the model’s performance in real-life conditions may be overestimated. In general, data leakage is a common pitfall in ML studies. This occurs when the model unintentionally learns from information in the test dataset. To ensure the reliability of the results, appropriate methods for data segmentation and careful evaluation strategies should be employed.

Fig. 13
figure 13

Validation methods used in the reviewed studies.

Open issues, challenges, and future trends

RQ7: What are the major challenges, future trends, and open issues in diagnosing PD with ML?

We propose future research areas categorized into three groups: open issues, future trends, and challenges, see Fig. 14. This is motivated by the growing need to identify clinical and practical approaches for diagnosing PD using ML, as well as the review and analysis of data from selected papers. Each group is detailed below. The primary challenge in diagnosing PD via ML approaches is the absence of a predefined method, making it essential to employ ML approaches to evaluate clinical and non-clinical data from potential patients, healthy individuals, and those already diagnosed with PD. Therefore, managing data imbalance, regulating and normalizing, and overcoming ethical and legal obstacles remain the primary issues in this subject. Additionally, several problems in PD diagnosis remain unresolved, including the lack of advanced methods, the limited use of wearable technology, non-generalizable datasets, and insufficient language diversity in voice datasets. Furthermore, prospects for future research include the use of genetic data, 3D dopamine imaging, and multimodal analysis. This introduction is followed by a detailed discussion of the challenges, potential developments, and unresolved concerns.

Fig. 14
figure 14

An overview of open issues, challenges, and future trends in diagnosing PD with ML.

Open issues

  • Generalizable Datasets: Many research have pointed out that there is an abundance of appropriate datasets for their evaluations8,57,72,77,84,86,91,97,102,113,115,117,122,124,127,128,129,131,134,135,139,147,148,150,151,153,154,174,175. In ML, one of the most fundamental principles is that the more data available for training, the more accurate model’s results will be. In addition, the authors did not assess their ML methods on any other datasets. The absence of evidence demonstrating the performance of the models on other data types may have resulted in misleading findings, even if the models may have performed successfully on some of them. Consequently, larger and more diverse cohorts should be included in research to ensure that the results are both reliable and generalizable to a broader population. The low sample sizes and lack of generalizability might represent an ongoing, unresolved issue.

  • Model Generalization: The capacity of models to generalize across diverse demographic and clinical contexts is a common source of worry. Indeed, the lack of thorough investigation using varied datasets limits our understanding of how well these models perform under diverse conditions. To address this, researchers are looking at ways to strengthen models by cross-validation and verifying them on different external cohorts106,115,117,130,147,148,150,151. Consequently, this ongoing issue may play a significant role in future research.

  • Lack of advanced analysis: The need for more advanced analytical tools to manage complicated data aspects, particularly non-linear correlations and interactions among variables, is often brought up in discussions. This open issue will be discussed often by advocating the use of advanced ML and DL frameworks. This will allow for a better grasp of the intricacy involved113,115,117,119,123,124,127,128,129,130,131,133,134,135,140,172.

  • Limited language diversity: Most of the presented models, which analyze vocal datasets73,74,77,138, have only been assessed by a single language, which again may produce inaccurate findings since other languages were unable to operate well with their models. Indeed, the amount of sound that may be made in different languages is varied; as a consequence, their method and results are only valid for a restricted number of individuals. In addition, the ethnicity and race of the participants have not been explored as a parameter in almost all of the datasets. These two factors have a direct bearing on the conclusions that may be drawn about a PD diagnosis since members of the same population might exhibit symptoms of this disease in the same manner. This open issue should be resolved to obtain more reliable results.

  • Personalized treatment: The development of predictive models using ML has the potential to significantly enhance the early identification and treatment of PD, eventually leading to improved patient outcomes. Among the significant challenges that still need to be solved is the development of individualized prediction models that account for specific characteristics of an individual, such as their age, gender, medical history, genetics, and lifestyle variables. The possible outcome of this is that it might lead to more accurate predictions and improved treatment programs that are specifically customized to meet the requirements of those who are at risk for developing PD. Specifically, studies74,128,151 raised questions in this regard.

  • Wearable technologies: ML can evaluate patient body vibrations and wearable sensor data to diagnose better and monitor illness development. Another unresolved challenge is using wearable devices that continually monitor motor and non-motor symptoms in PD patients and giving real-time data to dynamically change therapy. Wearable sensors, such as smart watches, smart bands, and other available wearable sensors9,181,182,183, can be utilized as a straightforward and accessible diagnostic tool for PD. Nevertheless, substantial work remains in this area. Wearable devices for continuous monitoring pose challenges such as ensuring data accuracy, enhancing patient comfort to improve compliance, securing sensitive health data, extending battery life, and effectively integrating collected data into clinical practice. Addressing these challenges is essential for using wearables to monitor PD and other neurological disorders, enhancing treatment options and patient outcomes. A gadget should be comfortable, washable, and able to detect sickness phases and drug effects. For example, a wearable wristband collects data continuously over time and identifies the various symptoms of PD74,122,131,135.

  • Multimodal analysis: Multimodal ML algorithms are one way to use data from multiple datasets to diagnose PD more accurately. The method incorporates data from multiple datasets, such as wearable sensors, medical imaging, and clinical data, and optimized algorithms for each source. Although there are several studies on various biological signals to evaluate mobility impairments in PD patients, most studies only analyze a single method. Literature indicates that the multimodal method is more precise than utilizing each method independently. For example, in184 and185, the authors assessed existing PD staging systems and determined that incorporating additional modalities and functionalities beyond a single modality, such as motor symptoms, is essential. This approach would enable a more objective scoring system, leading to a more comprehensive evaluation of patients’ symptoms and facilitating personalized treatment for each individual186. In general, existing signal processing and classification techniques incorporating data from many sensors have not been adequately evaluated. Although significant progress has been documented in multiple studies, there is currently no multimodal fusion system that can accurately forecast illness severity and track disease progression8,62,92,102,119,131,140,174. This open issue may play a vital role in future research.

Challenges

  • Data availability: The limited availability of comprehensive datasets of good quality hinders research endeavors. Obtaining longitudinal data, diverse patient groups, and multimodal datasets that include clinical, genetic, and imaging data is a significant difficulty for many research studies. The majority of currently available data are either difficult to proprietary or access, belonging to hospitals or enterprises, and not openly accessible. This challenge affects the robustness, generalizability, and depth of research findings. Without diverse and extensive datasets, researchers cannot validate their models across different conditions, leading to less reliable and applicable results.

  • Regularization and normalization: The regularization and normalization processes are crucial DL tools for PD identification. These strategies ensure the model learns valuable medical data patterns, improves generalizability, and prevents overfitting. This improves diagnostic accuracy and reliability. Even if these methods are not adopted widely by the DL research community, they have the potential to become a fruitful field of study in the following years.

  • Handling data imbalance: An issue that often arises is the presence of imbalanced datasets, in which certain classes or outcomes are underrepresented. As a consequence of this imbalance, models may be slanted toward the class that comprises the majority. This, in turn, may have an effect on the accuracy of projections produced for underrepresented groups within the population. The inability of a model to accurately represent minority groups might result in diagnostic tools that are less practical and clinical insights that are prejudiced130,154,175.

  • Computational resource limitation: Advanced ML and DL methods often require substantial computational resources, which may not be available in all research settings, particularly in smaller clinics or institutions in developing regions. This restricts the ability of some researchers to employ the most advanced methodologies, potentially leading to slower progress or less innovative outcomes.

  • Ethical and regulatory hurdles: In the healthcare industry, obtaining authentic patient data while safeguarding privacy is a significant challenge. This difficulty is compounded by the lack of balanced neurological disease datasets, which can adversely affect the efficiency of ML models. Furthermore, there are many obstacles to overcome when negotiating the ethical and regulatory environment, particularly when working with sensitive health data and using AI-driven technologies in clinical settings. The time required to obtain necessary approvals and ensure compliance with ethical standards and regulations can delay research progress and the implementation of findings in clinical practice.

  • Clinical adoption and usability: Designing tools and models that are technically sound, user-friendly, and practically applicable in clinical workflows is a common obstacle. Even with strong technical performance, the lack of usability and integration into clinical practice can limit the adoption and impact of research innovations.

  • Real-testbed evaluation: Clinical testing and evaluation of ML models is essential. The move from controlled research to real-world clinical practice involves significant problems. To ensure these models are productive and safe in real-world settings, patient variety, data heterogeneity, and ethical implications must be considered. Data scientists and healthcare practitioners must collaborate to create clinically relevant and accurate models. Healthcare experts can illuminate PD diagnosis, patient care, and the nuances of the disease. Early identification and management of PD can improve the lives of patients with this severe disorder.

  • Robust validation methods: To address the identified methodological challenges, future studies should emphasize robust validation methods, such as external validation using completely independent datasets and k-fold cross-validation. Furthermore, clear documentation of the data preprocessing process, and the dataset segmentation methods is essential to reduce the risk of data leakage. The use of standard datasets to compare model performance, along with advanced techniques such as nested cross-validation, can enhance the validity of results. By concentrating on these methods, researchers can contribute more effectively, repeatedly, and transparently to this field.

Future trends

  • Multimodal brain MRI: By revealing the structural and functional changes in the brain associated with PD, multimodal brain MRI can enhance the accuracy of PD diagnosis. In addition to aiding in early diagnosis and identifying prodromal stages of PD, advanced MRI techniqueswhen analyzed with ML can improve diagnostic precision by integrating additional data layers from various MRI modalities187. Moreover, integrating MRI data with additional biomarkers, such as blood or cerebrospinal fluid indicators, through the use of ML models has the potential to dramatically improve the early identification and differential diagnosis of PD. The combination of multimodal MRI and ML offers new avenues for improving PD diagnosis.

  • 3D Dopaminergic imaging: Future research in PD should focus on developing 3D Dopaminergic imaging to map brain dopamine activity more precisely. This method has the potential to increase the accuracy of early diagnosis, develop individualized treatment methods, and evaluate the effectiveness of treatment procedures in real-time. While accurate visualization of dopaminergic neurotransmitter oscillations is expected to enhance the diagnosis of PD progression, it may also aid in the development of more effective medicines. To gain a better understanding of PD diagnosis and progression, three-dimensional maps of dopaminergic neurotransmitter activity in the brain are produced using advanced imaging methods87,104.

  • Genetic and biomarker discovery: Future research on PD should focus on genetic profiling and biomarker discovery to enhance diagnostic precision and personalized treatments. Researchers can identify unique genetic markers and better understand their influence on disease progression and treatment response by analyzing genetic data across diverse populations and correlating it with clinical outcomes. Through the use of this integrated method, new treatment targets may be discovered, enabling earlier diagnosis and more successful, individualized therapy of PD. To facilitate an earlier and more accurate diagnosis, it is necessary to carry out genetic profiling to find biomarkers predictive of the progression of PD54,86,96,102,107,144.

  • Drug effect: The use of ML in research on the influence of medications on neuromelanin in PD has the potential to dramatically improve the accuracy and speed of these studies. ML algorithms can analyze complex imaging data to find subtle changes in neuromelanin content and distribution in the SN as a result of therapies. These algorithms may be trained on datasets that are generated by modern 3D imaging techniques, which enables more accurate segmentation and analysis. In addition, ML models may be evaluated over a wide variety of external cohorts to evaluate their robustness. This helps ML models to accurately predict and generalize the effects of drugs on neuromelanin. This approach has the potential to develop therapeutic solutions that are more precisely targeted and a more comprehensive understanding of the progression of PD87.

  • Symptom variability: The diagnosis of PD and the subsequent surveillance of its progression may prove challenging due to the extensive range of individual symptoms. As a result of the fact that symptoms might differ in terms of their presentation, frequency, and intensity, developing an applicable diagnostic paradigm remains a challenging endeavor. This issue may be resolved, and PD diagnosis and therapy could be significantly enhanced by integrating a variety of data types into ML models. These data types may include imaging data, audio recordings, and clinical evaluations. The evaluation of all the patient symptoms can lead to the development of more personalized and effective medicines.

  • Advancing early detection and classification: Future studies should prioritize the early detection of PD before severe symptoms appear, as well as the differentiation of various Parkinson’s syndromes. Researchers can improve generalization across different populations by employing longitudinal data and multimodal datasets, along with advanced ML approaches, such as transfer learning and ensemble methods. Additionally, exploring tasks beyond simple binary classification—such as tracking disease progression and classifying subsets—holds significant potential for enhancing clinical applications.

  • Adoption of evaluation metrics: A robust evaluation criteria selection is essential for the development of PD diagnosis models. Although accuracy is widely used as a metric, its limitations in imbalanced datasets highlight the need for alternative evaluation measures. Studies should focus on metrics such as F1-score, balanced accuracy, AUC (ROC), and MCC, which provide a more comprehensive assessment of model performance and address the imbalanced class. Furthermore, researchers should justify their selection of specific criteria to enhance transparency and improve research reproducibility. This approach leads to standardization of reporting criteria and facilitates comparability of results among various studies.

Threats to validity and limitations

This study aims to offer a systematic and comprehensive review that compares and classifies various ML methods for diagnosing PD. Although the result of SLR is typically reliable from different aspects188, having limitations in these papers is inevitable34. Consequently, the most substantial limitation of this study has been highlighted below.

  • Papers were selected from multiple well-known databases, including Elsevier, IEEE, Taylor & Francis, Springer, ACM, Nature, and Wiley, but we cannot certify that all related papers were selected. As a result, as described in Stage 3.2 of the conducting phase, it is possible that some articles were overlooked throughout the paper selection process.

  • We classified the selected papers into five groups: acoustic data, biomarkers, movement data, medical imaging, and multimodal dataset. There could be alternative potential categorization, though.

  • This SLR is organized around six fundamental questions, while there may be other important inquiries to consider.

  • Despite the extensive literature on using ML approaches for diagnosing PD, this SLR focused on papers listed in the JCR, disregarding reputable conference papers. Furthermore, articles published nationally, book chapters, short papers, conference papers, editorial papers, and works written in languages other than English were not considered.

  • This SLR can be considered highly credible due to the defined review protocol, adherence to a systematic process, and collaboration with multiple researchers in this research.

Conclusion

A comprehensive SLR in diagnosing PD based on ML is conducted in this study. This paper examines the use of well-known ML algorithms, as well as the employed datasets, repositories, applied tools, evaluation factors, validation methods, and relevant algorithms, comparing their advantages and disadvantages. In the first phase, 729 papers were collected based on a research query from 2021 to April 2024. The top 133 papers were chosen for the investigation based on methodology and inclusion/exclusion criteria. Elsevier and Springer together constitute two-thirds of the proportion in this field, with the former accounting for 35% and the latter for 31% of the contribution. Nature contributed only 11% of the papers. A fraction of a sixth of all publications will be devoted to the IEEE. On the other hand, Taylor & Francis and the ACM each recorded the least share, reaching barely one percent. Our study addressed a significant gap in research on early detection of PD, as only 20% of the reviewed articles focused on this aspect. We offered a category in which the selected papers were organized into five unique groups: acoustic data (23%), biomarkers (20%), medical imaging (20%), movement data (26%), and multimodal (11%). Statistical analysis reveals that 20% of the papers aimed to improve accuracy, while 18% focused on enhancing sensitivity. Python and MATLAB were the preferred tools, with 42% of publications using Python and 23% using MATLAB. The most commonly used algorithms were DL-based algorithms, which utilized in 20% of the selected papers. Based on the reviewed papers, nearly half of the studies (about 49%) utilized cross-validation methods to assess model performance, while 29% depended on the train-test split technique. The study emphasized open challenges and future trends in identifying PD using ML techniques. Open issues in this field include generalizable datasets, model generalization, lack of advanced analysis, limited language diversity in vocal datasets, personalized treatment, wearable technologies, and multimodal analysis. Additionally, significant challenges in this field, include data availability, regularization and normalization, handling data imbalance, computational resource limitation, ethical and regulatory hurdles, clinical adoption and usability, and real-testbed evaluation. Future trends may focus on multimodal brain MRI, 3D dopaminergic imaging, genetic and biomarker discovery, drug effect, and symptom variability.