Abstract
We aimed to develop a convolutional neural network (CNN) model with multi-task learning to predict the onset of levodopa-induced dyskinesia (LID) in patients with Parkinson’s disease (PD) using baseline [18F]FP-CIT PET images. In this retrospective, single-center study, 402 patients were classified based on whether they developed LID within 5 years after starting levodopa (within 5 years: n = 134; beyond 5 years or none: n = 268). The proposed CNN model achieved a mean AUROC ± SD of 0.666 ± 0.036. Model-derived probabilities were also incorporated into a Cox regression model, yielding a mean concordance index (C-index ± SD) of 0.643 ± 0.046, significantly outperforming the model based on specific/nonspecific binding ratios of striatal subregions (C-index = 0.392 ± 0.036) in four of five test configurations. These results suggest that model-extracted features from [18F]FP-CIT PET carry prognostic value for LID, although further performance improvements are needed for clinical application.
Similar content being viewed by others
Introduction
The incidence of Parkinson’s disease (PD), the second most prevalent neurodegenerative disease following Alzheimer’s disease, is rapidly increasing worldwide1. The increasing global burden of PD in the absence of disease-modifying therapy mandates reliance on symptomatic treatment, including dopaminergic medication, and surgical intervention in advanced stages1,2,3. Levodopa is the gold standard treatment for motor-symptom alleviation, although prolonged levodopa therapy induces serious, debilitating motor complications, such as wearing off and levodopa-induced dyskinesia (LID)2,4, which presents as dose-related hyperkinetic involuntary movements secondary to dopaminergic treatment that reduces the quality of life in PD5 and afflicts approximately 40% of patients within 4–6 years of pharmacotherapy6. Depending on interindividual variations, including risk factors such as younger age at PD onset, higher levodopa dosage, favorable medication response, female sex, greater motor and nonmotor burdens, genetic risk score, and striatal dopamine-depletion pattern, LID presents at different timepoints during the disease course7,8,9,10. Besides clinical characteristics that are easily obtained through history-taking and neurological exams, the quantification of striatal dopamine depletion can be measured by radiotracer-uptake reduction on presynaptic dopamine imaging with single photon emission computerized tomography (SPECT) and positron emission tomography (PET)11,12.
Machine learning (ML) and deep learning (DL) techniques have been applied to SPECT and PET images of PD. ML and convolutional neural networks (CNNs) have achieved pooled area under the receiver operating characteristic curve (AUROC) of 0.96 for identifying PD from normal control and 0.93 for distinguishing PD from atypical parkinsonism using presynaptic dopamine PET imaging13. Further advancements in multilabel classification networks have successfully enabled the classification of PD, multiple system atrophy, progressive supranuclear palsy, and normal control14. Using both clinical and imaging features of dopamine transporter (DAT) SPECT in large public longitudinal datasets of PD, such as the Parkinson’s Progression Markers Initiative (PPMI) database, has enabled the prediction of motor prognosis and suggested new PD subtypes15,16,17,18. Additionally, although few studies have explored the use of DAT imaging in multi-task learning (MTL), MTL has been applied in various medical imaging tasks, including the diagnosis and segmentation of COVID-19 from chest radiography or CT scans, and tumor classification and segmentation in 3D-automated breast ultrasound images19,20,21. Compared to single-task learning, MTL allows DL models to learn more generalized features through smoothing out the noises from each task and to concentrate on important semantic features shared by multiple tasks22.
Studies using conventional statistical methods showed that, compared to patients without LID, those with LID exhibit more pronounced uptake reduction in the sensorimotor striatum or putamen, with greater right/left asymmetry in caudate uptake7,11,12,23,24,25,26. To predict LID using the PPMI database, ML algorithms were applied to demographics, motor and nonmotor symptom severity scores, and striatal uptake values from DAT SPECT27,28. However, DL models for predicting LID occurrence using presynaptic dopamine imaging in an end-to-end manner are scarce. In end-to-end learning, DL models automatically extract image features from the image file and target those relevant to the primary outcome without requiring imaging experts or experience, which makes them more applicable in actual clinical settings. Rather than manually calculating uptake values of striatal subregions or interhemispheric asymmetry, these models learn from low- to high-level features of the entire image.
In this study, we developed a MTL CNN model, combining classification and reconstruction tasks, using baseline presynaptic dopamine PET images ([18F]FP-CIT PET) to classify patients with PD into those who developed LID within 5 years of starting levodopa treatment (wLID group) and those who did not (woLID group). Predicted probabilities from the CNN model, denoted as DeepScores, were employed in the Cox regression analysis to predict the disease duration without LID. Explainable artificial intelligence (AI) techniques were applied for visualization and quantification of image and clinical feature importance.
Results
Patient characteristics
Of the 3102 patients who visited the movement disorder clinic at Asan Medical Center during the study period, 402 (wLID: 134 [33.3%]; woLID: 268 [66.7%]) met the inclusion criteria (Fig. 1) and were enrolled. Table 1 presents their initial clinicodemographic features and specific/nonspecific binding ratios (SNBRs) of PET images. Compared to the woLID group, the wLID group had a younger age at onset, higher Hoehn and Yahr (H&Y) scales, and higher Unified Parkinson’s Disease Rating Scale (UPDRS) Part 3 scores for finger taps, leg agility, and rigidity. No significant intergroup difference in the SNBRs of the ventral striatum (VS), anterior caudate (AC), and posterior caudate (PC) were observed between the two groups, while anterior and posterior putamen (AP and PP, respectively) showed lower values in the wLID group.
Evaluation of DL and ML models
Supplementary Table S1 and Fig. 2 present the performance of CNN and ML models. A total of three CNN models (image-only single-task, image-only MTL, and image-clinical variable MTL) and three ML models (logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost)) were compared using five hold-out test sets derived from five different training/test data configurations. Image-only models were based solely on PET images, whereas image-clinical variable models also incorporated clinical features listed in Table 1. Details of the clinical variables used for model training are provided in the Methods section. The image-only MTL model generated a higher mean AUROC (mAUROC) (standard deviation, SD) of 0.666 (0.036) than the image-only single-task model at 0.643 (0.022), although the difference was not statistically significant (p = 0.12). Adding clinical variables to the image-only MTL model, resulting in the image-clinical variable MTL model, further increased the mAUROC (SD) to 0.694 (0.034), but the difference remained insignificant (p = 0.12). In contrast, with the addition of clinical variables, all ML models showed improved mAUROCs (LR: p = 0.02, RF: p = 0.003, XGBoost: p = 0.02), suggesting that the clinical information compensated for the possible insufficiency of SNBRs in predicting wLID. The mAUROCs did not significantly differ across ML models for both image-only and image-clinical variable inputs.
Center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; diamond-shaped points, outliers. CNN MTL convolutional neural network multi-task learning, LR logistic regression, RF random forest, XGBoost extreme gradient boosting, AUROC area under the receiver operating characteristic curve. P-values: <0.05 (*), <0.01 (**).
Among all models trained with image-only data, the image-only MTL model achieved the highest mAUROC (image-only MTL vs LR: p = 0.18, image-only MTL vs RF: p = 0.08, image-only MTL vs XGBoost: p = 0.03). For models trained with image and clinical variables, the RF achieved the highest mAUROC, without significantly differing from the image-clinical variable MTL model (RF vs image-clinical MTL: p = 0.51, RF vs LR: p = 0.15, RF vs XGBoost: p = 0.78). Additionally, the mAUROC of the image-only MTL model did not significantly differ from any of the three ML models trained with both image and clinical variables (image-only MTL vs LR: p = 0.34, image-only MTL vs RF: p = 0.19, image-only MTL vs XGBoost: p = 0.37).
Explainable artificial intelligence
Figure 3a shows the activation map of a PET image from a patient correctly classified into the wLID group by the image-only MTL model. To visualize the distribution of activation values across all images, we generated a violin plot for intergroup comparison of the mean activation values of ROIs in the wLID and woLID groups (Fig. 3b, c). In the structural atlas, activation values were generally high on the AC, with the most significant intergroup differences observed in the AP and PP. In the connectivity atlas, all three region of interests (ROIs) showed distinct intergroup activation distributions, with the greatest difference in the sensorimotor striatum.
a Example of an activation map overlaid on the corresponding PET image. b Violin plot of the mean activation values of the five regions of interests (ROIs) based on the structural atlas (center line, median; lines above and below the center line, upper and lower quartiles). c Violin plot of the mean activation values of the three ROIs based on the connectivity atlas. Vent. ventral, Ant. anterior, Pos. posterior. The ROIs are shown on a PET image next to the violin plots. P-values: <0.05 (*), <0.01 (**), <0.001 (***).
Figure 4 presents the SHapley Additive exPlanations (SHAP)29 values for a test set from the RF model trained with image only (Fig. 4a) and with addition of clinical variables (Fig. 4b), which was selected for achieving the best AUROCs of 0.682 and 0.792, respectively, among all ML models. Consistent with the activation map trends, SNBRs of AP and PP ranked high in importance, with higher SNBRs constituting strong indicators for a slower onset of LID. Among clinical variables, key contributors to LID presentation included younger age at onset, high UPDRS Part 3 finger-tapping score, and high H&Y scale.
DeepScore as a Time-to-Event Predictor
Supplementary Fig. S1 and Table S2 present Kaplan–Meier curves and results of the log-rank test, comparing the wLID and woLID groups classified by the image-only and image-clinical variable MTL models. Table 2 presents the results of Cox regression analysis. The mean (SD) concordance indices (C-indices) across the five test sets were 0.623 (0.041) for the Cox model with DeepScores from the image-only sinlge-task model, 0.643 (0.046) for the image-only MTL model, and 0.652 (0.045) for the image-clinical variable MTL model. The SNBR-only Cox model showed a mean (SD) C-index of 0.392 (0.036), while the model incorporating SNBRs with clinical variables yielded a lower C-index of 0.348 (0.053), which may be attributed to overfitting. Although the three CNN models showed no significant differences in C-indices, they outperformed the Cox models fitted with SNBRs or with SNBRs plus clinical variables in at least four out of five test sets.
The mean (standard deviation) of the C-indices of the five test sets are provided. The differences in the C-indices are calculated by subtracting the C-index of the model in the second row from that of the model in the first row.
Medication Effect on LID
The wLID group had significantly higher levodopa-equivalent daily dose (LEDD) than the woLID group up to the third year, with no significant differences in the fourth and fifth years (first year: p < 0.001; second year: p < 0.001; third year: p = 0.02; fourth year: p = 0.38; fifth year: p > 0.99), based on the Mann–Whitney U test after Bonferroni correction for multiple comparisons (p-values multiplied by 5). In contrast, the woLID group had significantly higher cumulative levodopa-equivalent dose (LED) in the fifth year compared to the wLID group, with no significant differences in the earlier years (first year: p = 0.59; second to fourth years: p > 0.99; fifth year: p = 0.01), using the same test and correction (Fig. 5).
Given differences in LEDD and cumulative LED between the wLID and woLID groups, we evaluated their impact on LID occurrence using Cox regression models (Supplementary Table S3). Building on the previous survival analyses with the same SNBR and clinical variables, we added LEDD, LEDD × log(time), cumulative LED, and cumulative LED × log(time). Among all variables, only LEDD was a significant predictor of LID (hazard ratio, HR: 1.0052, 95% CI: 1.0008–1.0097; p = 0.02). Due to the correlation between LEDD and cumulative LED in meaning, despite low variance inflation factors (1.6 and 1.68, respectively), we fitted two separate models: one with LEDD and its interaction term, and another with cumulative LED and its interaction. In the latter, neither cumulative LED nor its interaction was significantly associated with LID. In contrast, higher LEDD was linked to increased LID risk (HR 1.0084, 95% CI: 1.004–1.0127; p < 0.001), with a decreasing effect over time (LEDD × log(time): HR 0.999, 95% CI: 0.9984–0.9996; p = 0.002).
DeepScores from the image-only single-task, image-only MTL, and image-clinical variable MTL models were each used in separate Cox regression models, alongside clinical variables, LEDD, and LEDD × log(time). While SNBRs of the five ROIs showed no significant association with LID occurrence under identical experimental conditions, all three DeepScores significantly increased LID risk (image-only single-task: HR 604.2, 95% CI: 15–24327.9; p < 0.001; image-only MTL: HR 32.7, 95% CI: 1.5–691.2; p = 0.03; image-clinical variable MTL: HR 614.3, 95% CI: 75.5–5000.1; p < 0.001). Cumulative LED and its interaction term were excluded, as prior analyses showed no significant association. These models were fitted using DeepScores from all five test sets predicted by their respective models. To validate consistency, we also fitted Cox models for each test set individually, which showed similar trends with varying p-values (Supplementary Table S4).
Discussion
In this study, we applied a CNN model with MTL on [18F]FP-CIT PET images to predict LID onset in PD within 5 years of starting levodopa. The model-predicted probabilities, named DeepScores, were used for binary classification of patients into wLID and woLID groups, and subsequently, in a Cox regression model to predict LID-free disease duration. Training a CNN model with MTL of classification and reconstruction in parallel, where the Shared Encoder and their separate heads were trained simultaneously, along with integration of clinical features, led to increased mean AUROCs and C-indices. However, the degree of improvement varied among different training and test set configurations and was not sufficient to reach statistical significance.
In the binary classification setting, the image-only CNN models, whether trained in a single-task manner for classification only (image-only single-task) or in a MTL framework combining classification and reconstruction (image-only MTL), achieved mAUROCs comparable to that of the CNN model taking both image and clinical variables as inputs (image-clinical variable MTL). In contrast, all three ML models showed a significant increase in mAUROCs after addition of clinical variables to SNBRs. This finding suggests that the image-only CNN models were able to extract imaging features that, in case of the ML models, required compensation through clinical variables. Furthermore, the activation maps of our image-only MTL model visualized the salient regions for the prediction of the wLID group. The ROIs with the largest differences in average activation values between the wLID and woLID groups were AP and PP, in the structural atlas, and the sensorimotor striatum, in the connectivity atlas. These results are consistent with the SHAP values obtained from the RF model, where the AP and PP SNBRs were ranked highly. Patients with early-onset LID tend to have more reduced DAT availability in the putamen and the sensorimotor striatum, as shown in many studies11,30. Baseline DAT imaging in de novo, drug-naive patients with PD has been studied in similar settings using [123I]FP-CIT SPECT from the PPMI public dataset24 and [18F]FP-CIT PET from a local dataset in South Korea12,25,26. In the PPMI dataset, patients who developed LID had lower putaminal specific binding ratios (SBRs) at baseline and a higher rate of decrease in putaminal SBRs over 2–4 years24. Similarly, lower DAT availability of putamen or sensorimotor striatum were associated with LID in the local dataset12,25,26. Importantly, these previous works on DAT imaging used DAT levels to compare patients with and without LID, whereas we used the original PET images, rather than SNBRs, to distinguish the two groups and showed that our model was trained in a direction consistent with prior findings.
Our three CNN models were trained for binary classification of patients with and without LID using a 5-year threshold, but LID occurrence can also be considered a time-to-event problem in survival analysis. Building on a previous work that used retinal photographs to predict coronary artery calcium scores for cardiovascular risk stratification31, we applied our CNN models’ predicted probabilities of being classified into the wLID group, DeepScores, for survival analysis using Cox regression models. Although Cox regression models trained with SNBRs, with or without clinical variables, tended to overfit to the training sets, leading to low C-indices in the test sets, those trained with DeepScores showed solid performance with significant improvements in C-indices in at least four out of the five test sets. Using another Cox regression model with time-varying covariates and time-varying coefficients, we also demonstrated that DeepScores still remained a significant risk factor for LID, even after adjusting for LEDD, and its interaction term from prescription records up to the most recent one just before LID occurrence. In contrast, none of the SNBRs showed a significant association. Cox regression analyses in previous studies on LID prediction relied on multiple imaging and clinical variables7,23,25,26, whereas we used the CNN model’s predicted probabilities to demonstrate that the model automatically learns time-to-event information during binary classification. Furthermore, similar to the substantial variability observed in AUROCs and C-indices across different test sets, the p-values of LEDD, its interaction term, and DeepScore from Cox models also varied considerably among them. In one of the five test sets, none of these variables showed a significant association with LID. In another test set, the p-values of LEDD and its interaction term fluctuated depending on which DeepScore, derived from which CNN model out of the three models, was used. This finding suggests that other factors not included in this study, or currently unknown factors, may have contributed to early LID occurrence in the patients in these test sets, highlighting the need for further research to explore new LID risk factors.
Our study has several key strengths. First, we used PET images and readily obtainable clinical variables from the initial assessment as model inputs. Although LEDD and duration of levodopa treatment are the strongest predictors of LID7,8,28, we showed that baseline features alone without dopaminergic medication history can be used to predict LID onset. Whereas our model’s AUROCs were lower than a previous study that used ML models with the public PPMI dataset, that study employed clinical information from the last visit before LID onset or the latest available record for patients without LID28. Importantly, their input features included LEDD and duration of levodopa treatment. Another prior study using ML models designed for survival analysis predicted LID using non-imaging clinical features across multiple cohorts27. Although their reported AUROCs were comparable to ours, their models required a wide variety of detailed clinical inputs, including UPDRS Parts 1, 2, and 3 scores, autonomic function tests, activities of daily living, and genetic mutations, which may not be easily obtainable in typical clinical settings27.
Second, our CNN models were trained in an end-to-end manner while ML models required SNBR calculation per ROIs. While some ML models showed comparable performance to ours, using SNBRs as image features has drawbacks such as inaccuracies arising from coregistration failure of ROI templates and native images. As PET or T1-weighted magnetic resonance imaging (T1 MRI) scans of patients with PD often deviate from those of normal controls, such as in cases of cortical atrophy, coregistering these images to brain templates built from normal controls can induce low coregistration quality. While brain templates specifically developed from images of patients with PD are available32, they may still struggle to represent the diverse neuropathological findings within this population. Manual labeling of striatal regions by human experts can provide accurate SNBRs, but labeling is often very costly and time-consuming. In comparison, our CNN models can be more readily applied in clinical settings without the need for prior image coregistration and SNBR calculation.
Third, the probability scores generated by the CNN models, referred to as DeepScores, may serve as a supplemental risk factor for predicting LID susceptibility. Although the advantage of DeepScores over SNBRs was not evident in the binary classification setting, DeepScores significantly improved survival prediction in most test sets. The fact that DeepScores were significantly associated with LID after adjusting for both medication and clinical factors, including major known risk factors such as age at symptom onset, age at PD diagnosis, and motor-symptom severity7,8,9,10, implies that DeepScore may represent a potential independent risk factor for LID. However, for DeepScore to be used for predicting LID onset, further efforts are needed to improve model performance, including investigation of additional imaging and clinical biomarkers related to LID.
This study has several important limitations. First, it was conducted retrospectively through medical record reviews rather than prospective monitoring of LID onset. As patients visited the movement clinic every 3–6 months, the exact timing of LID onset—potentially occurring between visits—could not be precisely determined. Additionally, since most visits occurred during regular outpatient hours, LID episodes in the early morning or late at night may have been missed. Given the retrospective design, we defined LID onset based on direct clinical observation rather than patient reports, which could be confounded by other movement symptoms such as tremor or dystonia. However, this approach may have led to misclassification, with some patients who experienced LID outside clinic visits potentially labeled as woLID, despite developing LID within five years. As result, both training and test datasets may have been biased toward the woLID group. Future studies should consider alternative data collection strategies, including more frequent follow-ups, wearable devices for dyskinesia detection, or in-home video monitoring. Sensitivity analysis using multiple LID onset date estimations within visit intervals could have helped assess the impact of timing discrepancies. However, in our relatively small dataset and binary classification setting, fewer than 2% of patients changed group classification under different estimation criteria, suggesting a minimal impact on the overall study results.
Second, although PET images were acquired as part of the initial assessment, some patients were already taking levodopa at the time of the scan. This occurred when movement specialists, confident in the PD diagnosis, prescribed levodopa at the initial visit, and PET scans could not be scheduled on the same day. To account for this, we limited the maximum medication period to 3 months.
Third, a 5-year threshold was used for binary classification of patients. Although the time interval between the initiation of levodopa and LID occurrence was initially calculated in days, which could have supported the development of a DL regression model, we determined that these recorded intervals might not be accurate for several reasons, as noted in the first limitation. Therefore, we chose to reformulate the task as a binary classification problem. We experimentally selected an appropriate threshold for categorization through ablation studies with 3-year, 5-year, and 8-year cutoffs. The 3-year and 8-year thresholds resulted in more severe class imbalances, leading models to predominantly predict the majority class rather than learning relevant features to differentiate between the wLID and woLID groups. In the absence of an established threshold for classifying patients based on LID onset, we selected the 5-year threshold, which yielded a 1:2 ratio between the wLID and woLID groups, consistent with proportions reported in a previous study using the PPMI database7. For future work, we plan to develop deep survival models which predict time-to-event rather than classification probabilities. Given that our CNN models optimized for binary classification already capture features related to survival, deep survival models could lead to more accurate predictions.
Fourth, our dataset consisted of 402 patients, which is relatively small for developing DL models. To enhance model robustness, we employed a five-fold cross-validation (5CV) strategy on each of five data configurations, which required 25 training runs (5 folds × 5 configurations) per model. This study design ensured that every image was included in a hold-out test set exactly once, minimizing bias in performance evaluation. Despite this effort, AUROCs varied across the five test sets (Fig. 2), highlighting the need for larger sample sizes to achieve more stable results. Similarly, in the survival analyses, the SDs of the C-indices were wide, and the 95% CIs of C-index differences between models varied by test set. Due to the limited dataset size, we were restricted to five data configurations, yielding five AUROCs or five C-indices per model. Consequently, paired t-tests comparing these model metrics were based on these five values, which may have limited the statistical power of the tests. Also, this is a single-center study without external test sets. Collecting a sufficient number of patients from other centers was challenging because patients had to be regularly seen by movement specialists for periods greater than 5 years starting from the initial visit. We recognize the importance of external validation for model generalizability and plan to collaborate with multiple centers to gather PET images for this purpose.
In conclusion, we developed a CNN model with MTL for binary classification of patients with PD into those who did or did not develop LID within 5 years of levodopa treatment, using baseline [18F]FP-CIT PET images. Moving one step further from statistical comparison of SNBRs between patients vulnerable or resistant to LID, we tried to evaluate the prediction power of DAT PET imaging in predicting the onset of LID. Although the model performance is yet insufficient for direct clinical application, this work represents one of the first attempts in evaluating the significance of DAT PET imaging as a predictor of motor prognosis in patients with PD using DL. Future efforts will aim to integrate additional clinical information and other brain imaging modalities, to improve prediction accuracy.
Methods
Participants
Patients with parkinsonism who visited our movement-disorder clinic from January 2005 to March 2022 were retrospectively enrolled. The inclusion criteria were as follows: (1) PD diagnosis, based on the United Kingdom PD Society Brain Bank criteria, by two movement specialists (SJC and SJ)33; (2) PD-related [18F]FP-CIT PET findings visually confirmed by nuclear medicine specialists34; (3) both PET and MRI scans performed at our center; (4) less than 5-year interval from motor-symptom onset to PET scan acquisition; (5) either levodopa-naive or levodopa initiated within 3 months; (6) regular follow-up (at 3–6 months) at our clinic; and (7) LID presentation or absence during 5-year regular follow-up. LID was defined as dyskinesia of the face, head, extremities or trunk, identified by movement specialists during regular follow-up. The exclusion criteria were as follows: (1) > 5-year interval between PET and MRI scan acquisition dates; (2) significant PET image artifacts; (3) ischemic striatal lesions on PET or MRI; (4) PET scans > 1.5 mm slice thickness; (5) unsatisfactory PET or MRI image-preprocessing output; (6) use of a different PET scanner; and (7) motor-symptom onset age ≤ 40. Participants were assigned to two groups; those with LID within 5 years (wLID) and without LID for > 5 years (woLID; those with LID onset beyond 5 years or were LID-free at > 5-year follow-up) after levodopa initiation. We selected a 5-year threshold, which yielded a manageable data balance of 1:2 between the wLID and woLID groups for CNN model training.
Clinical features
At the visit preceding levodopa initiation, we collected baseline demographic information and motor status, including sex, age at motor-symptom onset, H&Y scale, and specific UPDRS Part 3 items (tremor at rest, upper-extremity rigidity, finger tapping, and leg agility). Missing values for the H&Y scale and UPDRS Part 3 items, each accounting for less than 10%, were imputed using the median value of the respective variable. Using previous conversion rules35, LEDD and cumulative LED were calculated from prescription records, and annual dosages were compared between the wLID and woLID groups. For the wLID group, patients who developed LID within a given year were excluded from the LEDD and cumulative LED calculations for the following year. For example, when calculating second-year values, patients who developed LID during the first year were excluded.
PET and MRI acquisition
N-(3-fluoropropyl)-2β-carboxymethoxy-3β-(4-iodophenyl) nortropane ([18F]FP-CIT) was synthesized using a published protocol36; 180 min after an intravenous injection of 185 MBq [18F]FP-CIT, PET scans were performed for 10 min with a Biograph TruePoint 40 scanner (Siemens, Knoxville, TN, USA), which provides an in-plane spatial resolution of 2.0 mm full-width at half maximum at the center of the field of view. Before PET imaging, a low-dose brain computed tomography scan was conducted (120 kVp, 20 mAs, with 1.5 mm slice thickness) to facilitate image fusion and attenuation correction. PET scans were in three-dimensional mode, and reconstructed using the TrueX algorithm, with all-pass filters applied to matrices of 336 × 336.
Brain MRI T1 images were obtained in the axial orientation, with parameters reported as median (interquartile ranges, IQR) values to account for variations due to the retrospective nature of the study: TR 9.9 ms (9.8, 450.2), TE 4.6 ms (4.6, 10.0), flip angle 9.0° (8.0, 70.0), x, y-voxel spacing 0.5 × 0.5 mm (0.4, 0.5), slice thickness 5.0 mm (3.0, 5.0), and spacing between slices 7.0 mm (3.0, 7.0). Seven scanners with magnetic field strengths of 1.5 T and 3.0 T from three vendors were used: Achieva 1.5 T and 3.0 T, Ingenia 3.0 T (Philips), Magnetom Avanto 1.5 T, Skyra 3.0 T (Siemens), Signa Architect 3.0 T, and Signa CV/i 1.5 T (General Electric). As T1 images were used exclusively for registering PET images to the Montreal Neurological Institute (MNI) template, with no analysis conducted on the T1 images themselves, no restrictions were placed on the MRI acquisition parameters.
PET image preprocessing and quantification
PET and MRI T1 images were converted from the DICOM to NIfTI format with skull stripping using SynthStrip37 and HD-BET38, respectively. After bias-field correction with light regularization and a 60-mm full-width at half maximum (FWHM) cutoff, T1 images were coregistered with the corresponding PET images, and spatially normalized to the MNI template using SPM12 (Statistical Parametric Mapping, the Wellcome Trust Centre for Neuroimaging) on Matlab R2022b software. The inverse deformation map obtained from this normalization was applied to the following ROIs in the MNI space: VS, AC, PC, AP, and PP from the Oxford-GSK-Imanova structural atlas39, with the anterior commissure dividing the anterior and posterior regions; limbic, sensorimotor, and executive striatum from the Oxford-GSK-Imanova striatal connectivity atlas40; and bilateral calcarine cortices from the Automated Anatomical Labelling Atlas 341. The ROIs, mapped to the native PET space, were used to calculate the SNBR as follows: SNBR = {(mean uptake value (UV) of bilateral ROIs) – (mean UV of bilateral calcarine cortices)} / (mean UV of bilateral calcarine cortices).
For DL model training and testing, PET images were intensity normalized by dividing all voxel values by the mean UV of the bilateral calcarine cortices of each image. To concentrate on the striatum and reduce computational resources, we cropped the PET images to a uniform size (96 × 64 × 64), centered on the nonzero region of the Oxford-GSK-Imanova structural atlas.
Implementation of the DL model
The training dataset configurations and model architecture are depicted in Figs. 6 and 7, respectively. PET images were randomly divided into five subsets, stratified by wLID, sex, and age at motor-symptom onset. With each subset as a hold-out test set, the remaining subsets were subdivided into five folds for cross-validation, and then ensembled by averaging the predicted probabilities of wLID from each fold to generate the DeepScore, which was used for binary classification and for survival analysis. Therefore, all images in the five hold-out test sets had corresponding DeepScores and binary classification results. With all images from the entire dataset included in the hold-out test sets exactly once, this approach maximized data utilization and avoided bias from selecting a specific data configuration.
The Shared Encoder output serves as input for both the Reconstruction Decoder and Classification Heads. The Reconstruction Decoder rebuilds the cropped PET image, while Classification Heads I and II predict the probability of wLID. Numbers next to the blocks indicate the number of channels × voxels (y-axis × x-axis × z-axis).
Using MTL of classification and reconstruction to enhance the generalization accuracy of feature representation for binary classification, while concurrently reconstructing the original input image, we developed a three dimensional (3D)-CNN model that comprised a Shared Encoder with four residual blocks, a Reconstruction Decoder, and two Classification Heads. Using image-only features from the bottleneck layer or additionally incorporating clinical features, Classification Heads I and II, respectively, predict the probability of classification into the wLID group. We used three structural combinations; (1) Shared Encoder + Classification Head I (image-only single-task), (2) Shared Encoder + Reconstruction Decoder + Classification Head I (image-only MTL), and (3) Shared Encoder + Classification Head II (image-clinical variable MTL). Models (1) and (2) were trained from scratch, whereas model (3) used the pretrained weights from model (2) for the Shared Encoder, which were fixed during training. Implementation details are provided in Supplementary Table S5.
In Fig. 7, the 18 clinical variables included sex, age at onset, binary indicator for age at motor-symptom onset < 60 years, time from motor symptom onset to PET acquisition (in 3-month intervals), H&Y scale, UPDRS Part 3 scores for tremor at rest (right/left arm and leg), rigidity (right/left arm), finger taps (right/left), and leg agility (right/left), and sum of the scores of tremor at rest, rigidity, and bradykinesia (finger taps, leg agility). All clinical variables, except sex (encoded as 0 or 1), were min-max normalized to a range of 0–2 rather than 0–1, based on ablation study results showing improved model performance with this range. This decision also considered the value distribution of the Shared Encoder’s output vector (i.e., input vector to the classification head), which was to be concatenated with the clinical variable vector to generate the final classification predictions. The same normalization method was applied when building ML models.
ML models
Utilizing the same data configuration as in CNN models, LR, RF, and XGBoost models were used to predict the probability of wLID, with 5CV for hyperparameter tuning via grid search; the fitted model was evaluated on the corresponding test set. The models were trained in two settings: (1) using only SNBRs of the five ROIs, and (2) using both SNBRs and clinical features. For the ML models, the input clinical variables included sex, age at onset, time from motor-symptom onset to PET acquisition (in 3-month intervals), H&Y scale, and UPDRS Part 3 scores for tremor at rest (right + left arm, right + left leg), rigidity (right + left arm), finger taps (right + left), and leg agility (right + left). Compared to using individual scores, summing these scores enhanced model interpretability and generated higher mAUROC values. The models were implemented using Scikit-learn v1.0.2, SciPy v1.7.3, and XGBoost v1.6.2.
Explainable artificial intelligence
To visualize the image and clinical features used for discrimination, we employed activation maps for the image-only MTL model, and SHAP values for the ML models. We generated activation maps from the second-to-last convolution layer of the Shared Encoder by averaging features channel-wise, applying a sigmoid function, and interpolating to the input-image size. For each hold-out test set, the activation maps from five folds were averaged using the respective models trained for each test set. Subsequently, the mean activation values of the ROIs from the Oxford-GSK-Imanova structural atlas39 and the Oxford-GSK-Imanova striatal connectivity atlas40 were calculated by averaging voxel-wise activation values within each ROI for PET images that were correctly classified by the model. SHAP values were calculated for each test set, and the rankings of each feature were averaged across the five test-sets to determine overall feature importance. SHAP package v0.42.1 (https://github.com/shap/shap/releases/tag/v0.42.1) was used, with the Explainer function for LR and the TreeExplainer function for RF and XGBoost.
Survival analysis
Besides classifying patients into the wLID and woLID groups, we evaluated the DL model’s ability to predict LID-free survival to the last available follow-up dates. To compare survival distributions, Kaplan–Meier curves and log-rank tests with Bonferroni correction were performed on the five test sets. Using DeepScores from the three CNN models, SNBRs, and clinical variables of the ML models, we calculated the C-indices of the Cox regression models. Considering both p-values and residual plots, the proportional hazards assumption was checked using Schoenfeld residuals. Intervariable multicollinearity was assessed using a variation inflation factor threshold of 10, and Pearson correlation coefficients of 0.8 and 0.9 for clinical variables and SNBRs, respectively. The 95% CIs for the differences in the C-indices of the Cox models were obtained via 1000 bootstrap iterations on the test sets to compare model performances, as previously described31,42. As in the DL and ML models, all analyses used the same data configuration for training and testing Cox models.
For additional survival analysis, we evaluated the effect of medication on LID occurrence by fitting Cox regression models using all five test sets combined. These models included the same clinical variables as above, and either SNBRs or DeepScores added as covariates. Since LEDD and cumulative LED did not meet the proportional hazards assumption and varied over time, we applied a time-varying covariates and time-varying coefficients model. This incorporated interaction terms—LEDD × log(time) and cumulative LED × log(time)—based on the patterns observed in Schoenfeld residual plots. All analyses were conducted in R (version 4.3.1) using the survival package v3.5-5.
Model evaluation and statistical analysis
Baseline demographics, clinical features, and SNBRs are presented as frequency (proportion) for categorical variables and as mean (SD) or median (IQR) for continuous variables, depending on whether they are parametric or nonparametric. The normality of variables was assessed using the Shapiro–Wilk test and by visually inspecting data distribution through QQ plots and histograms. Chi-square and Fisher’s exact tests were used for categorical variables, whereas the Welch’s t-test, Student’s t-test, and Mann–Whitney U test were applied for continuous variables as appropriate. Using accuracy, sensitivity, specificity, F1 score, and AUROC, we evaluated the performance of the CNN and ML models on the five test sets, whose mean (SD) values were calculated across the five test sets and mAUROC was used to compare model performances. A paired t-test compared the mAUROC of each model, and an independent t-test compared the mean activation values of ROIs between model-predicted wLID and woLID groups. The paired t-test was chosen over bootstrapping to follow previous work43 and to reduce the inference time required for bootstrapping. A two-sided significance level of 0.05 was used for all tests.
Ethics approval
This cross-sectional study was approved by the institutional review board at the Asan Medical Center (IRB 2022-0614). The study protocol conforms to the ethical guidelines of the 1975 Declaration of Helsinki. Written informed consent from participants was waived by the ethics committee at the Asan Medical Center because of the retrospective nature of the study.
Use of large language models
We received assistance from the large language model, ChatGPT-4 (https://chat.openai.com/), developed by OpenAI (https://openai.com/), solely for grammatical revisions and text clarification. No new information or content was generated by ChatGPT.
Data availability
The data utilized in this study are not publicly accessible due to patient privacy concerns. However, requests to access the data may be considered upon contact with the corresponding authors.
Code availability
Source codes for preprocessing and model implementation will be openly accessible to the public at https://github.com/mi2rl/LIDprediction.
References
Collaborators, G. B. D. P.s. D. Global, regional, and national burden of Parkinson’s disease, 1990-2016: a systematic analysis for the Global Burden of Disease Study 2016. Lancet Neurol. 17, 939–953 (2018).
Tanner, C. M. & Ostrem, J. L. Parkinson’s Disease. N. Engl. J. Med 391, 442–452 (2024).
Bloem, B. R., Okun, M. S. & Klein, C. Parkinson’s disease. Lancet 397, 2284–2303 (2021).
Calabresi, P., Di Filippo, M., Ghiglieri, V., Tambasco, N. & Picconi, B. Levodopa-induced dyskinesias in patients with Parkinson’s disease: filling the bench-to-bedside gap. Lancet Neurol. 9, 1106–1117 (2010).
Pechevis, M. et al. Effects of dyskinesias in Parkinson’s disease on quality of life and health-related costs: a prospective European study. Eur. J. Neurol. 12, 956–963 (2005).
Ahlskog, J. E. & Muenter, M. D. Frequency of levodopa-related dyskinesias and motor fluctuations as estimated from the cumulative literature. Mov. Disord. 16, 448–458 (2001).
Eusebi, P. et al. Risk factors of levodopa-induced dyskinesia in Parkinson’s disease: results from the PPMI cohort. NPJ Parkinsons Dis. 4, 33 (2018).
Kelly, M. J. et al. Predictors of motor complications in early Parkinson’s disease: A prospective cohort study. Mov. Disord. 34, 1174–1183 (2019).
Warren Olanow, C. et al. Factors predictive of the development of Levodopa-induced dyskinesia and wearing-off in Parkinson’s disease. Mov. Disord. 28, 1064–1071 (2013).
Grandas, F., Galiano, M. L. & Tabernero, C. Risk factors for levodopa-induced dyskinesias in Parkinson’s disease. J. Neurol. 246, 1127–1133 (1999).
Labrador-Espinosa, M. A. et al. Levodopa-Induced Dyskinesia in Parkinson Disease Specifically Associates With Dopaminergic Depletion in Sensorimotor-Related Functional Subregions of the Striatum. Clin. Nucl. Med 46, e296–e306 (2021).
Yoo, H. S. et al. Presynaptic dopamine depletion determines the timing of levodopa-induced dyskinesia onset in Parkinson’s disease. Eur. J. Nucl. Med Mol. Imaging 45, 423–431 (2018).
Wang, J. et al. Diagnostic performance of artificial intelligence-assisted PET imaging for Parkinson’s disease: a systematic review and meta-analysis. NPJ Digit Med 7, 17 (2024).
Zhao, Y. et al. Decoding the dopamine transporter imaging for the differential diagnosis of parkinsonism using deep learning. Eur. J. Nucl. Med Mol. Imaging 49, 2798–2811 (2022).
Salmanpour, M. R. et al. Machine learning methods for optimal prediction of motor outcome in Parkinson’s disease. Phys. Med 69, 233–240 (2020).
Tang, J. et al. Artificial Neural Network-Based Prediction of Outcome in Parkinson’s Disease Patients Using DaTscan SPECT Imaging Features. Mol. Imaging Biol. 21, 1165–1173 (2019).
Leung, K. H., Rowe, S. P., Pomper, M. G. & Du, Y. A three-stage, deep learning, ensemble approach for prognosis in patients with Parkinson’s disease. EJNMMI Res 11, 52 (2021).
Salmanpour, M. R., Shamsaei, M. & Rahmim, A. Feature selection and machine learning methods for optimal identification and prediction of subtypes in Parkinson’s disease. Comput Methods Prog. Biomed. 206, 106131 (2021).
Zhou, Y. et al. Multi-task learning for segmentation and classification of tumors in 3D automated breast ultrasound images. Med Image Anal. 70, 101918 (2021).
Malhotra, A. et al. Multi-task driven explainable diagnosis of COVID-19 using chest X-ray images. Pattern Recognit. 122, 108243 (2022).
Amyar, A., Modzelewski, R., Li, H. & Ruan, S. Multi-task deep learning based CT imaging analysis for COVID-19 pneumonia: Classification and segmentation. Comput Biol. Med 126, 104037 (2020).
Zhao, Y., Wang, X., Che, T., Bao, G. & Li, S. Multi-task deep learning for medical image computing and analysis: A review. Comput Biol. Med 153, 106496 (2023).
Chung, S. J. et al. The Pattern of Striatal Dopamine Depletion as a Prognostic Marker in De Novo Parkinson Disease. Clin. Nucl. Med 43, 787–792 (2018).
Jeong, E. H., Sunwoo, M. K. & Song, Y. S. Serial I-123-FP-CIT SPECT Image Findings of Parkinson’s Disease Patients With Levodopa-Induced Dyskinesia. Front Neurol. 9, 1133 (2018).
Chung, S. J. et al. Patterns of striatal dopamine depletion in early Parkinson disease: Prognostic relevance. Neurology 95, e280–e290 (2020).
Hong, J. Y. et al. Presynaptic dopamine depletion predicts levodopa-induced dyskinesia in de novo Parkinson disease. Neurology 82, 1597–1604 (2014).
Loo, R. T. J. et al. Levodopa-induced dyskinesia in Parkinson’s disease: Insights from cross-cohort prognostic analysis using machine learning. Parkinsonism Relat. Disord. 126, 107054 (2024).
Leal, D. A. B., Dias, C. M. V., Ramos, R. P. & Brys, I. Prediction of dyskinesia in Parkinson’s disease patients using machine learning algorithms. Sci. Rep. 13, 22426 (2023).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Info. Proc. Syst. 30, https://doi.org/10.48550/arXiv.1705.07874 (2017).
Troiano, A. et al. PET demonstrates reduced dopamine transporter expression in PD with dyskinesias. Neurology 72, 1211–1216 (2009).
Rim, T. H. et al. Deep-learning-based cardiovascular risk stratification using coronary artery calcium scores predicted from retinal photographs. Lancet Digit Health 3, e306–e316 (2021).
Madge, V. et al. A dataset of multi-contrast unbiased average MRI templates of a Parkinson’s disease population. Data Brief. 48, 109141 (2023).
Gibb, W. & Lees, A. The relevance of the Lewy body to the pathogenesis of idiopathic Parkinson’s disease. J. Neurol., Neurosurg. Psychiatry 51, 745–752 (1988).
Oh, M. et al. Subregional patterns of preferential striatal dopamine transporter loss differ in Parkinson disease, progressive supranuclear palsy, and multiple-system atrophy. J. Nucl. Med 53, 399–406 (2012).
Schade, S., Mollenhauer, B. & Trenkwalder, C. Levodopa Equivalent Dose Conversion Factors: An Updated Proposal Including Opicapone and Safinamide. Mov. Disord. Clin. Pr. 7, 343–345 (2020).
Lee, S. J. et al. One-step high-radiochemical-yield synthesis of [18F]FP-CIT using a protic solvent system. Nucl. Med Biol. 34, 345–351 (2007).
Hoopes, A., Mora, J. S., Dalca, A. V., Fischl, B. & Hoffmann, M. SynthStrip: skull-stripping for any brain image. NeuroImage 260, 119474 (2022).
Isensee, F. et al. Automated brain extraction of multisequence MRI using artificial neural networks. Hum. Brain Mapp. 40, 4952–4964 (2019).
Tziortzi, A. C. et al. Imaging dopamine receptors in humans with [11C]-(+)-PHNO: dissection of D3 signal and anatomy. Neuroimage 54, 264–277 (2011).
Tziortzi, A. C. et al. Connectivity-based functional analysis of dopamine release in the striatum using diffusion-weighted MRI and positron emission tomography. Cereb. Cortex 24, 1165–1177 (2014).
Rolls, E. T., Huang, C. C., Lin, C. P., Feng, J. & Joliot, M. Automated anatomical labelling atlas 3. Neuroimage 206, 116189 (2020).
Bae, S. et al. Radiomic MRI Phenotyping of Glioblastoma: Improving Survival Prediction. Radiology 289, 797–806 (2018).
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Acknowledgements
This research was supported by National Research Foundation of Korea (NRF) (RS-2023-00262527). This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: RS-2023-00265820, HR20C0026). This research was supported by a grant of the MD-Phd/Medical Scientist Training Program through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea.
Author information
Authors and Affiliations
Contributions
G.Y.L. was responsible for conceptualization, methodology, investigation, visualization, funding acquisition, and drafting the initial manuscript. J.W. contributed to the methodology, investigation, and the review and editing of the manuscript. S.K. and J.K. supported the methodology. S.J., J.L., S.L., J.S.K., C.S., and J.S.O. contributed to the conceptualization, methodology, and investigation. N.K. and S.J.C. led the conceptualization, methodology, funding acquisition, project administration, supervision, and the review and editing of the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lee, G.Y., Won, J., Kim, S. et al. Baseline [18F]FP-CIT PET-based deep learning prediction of levodopa-induced dyskinesia in Parkinson’s disease. npj Parkinsons Dis. 11, 125 (2025). https://doi.org/10.1038/s41531-025-00982-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41531-025-00982-5