Introduction

Primary vitreoretinal lymphoma (PVRL) is a rare yet potentially aggressive intraocular malignancy that is characterized predominantly by B-cell lymphomas1,2. The estimated annual incidence of PVRL is approximately 50 cases, and there is evidence suggesting an increasing trend in its occurrence3,4. Owing to its infrequency, PVRL is often misdiagnosed or inadequately managed and is frequently mistaken for other intraocular conditions5. This misidentification can result in significant diagnostic delays, sometimes extending up to 21 months from the initial presentation5. The diagnosis of PVRL is further complicated by multiple factors, such as the small volume of vitreous humor, the low cellularity of lymphoma cells—which frequently coexist with inflammatory cells—and the challenges associated with maintaining cellular integrity during sample collection6,7. More importantly, given that PVRL can lead to permanent vision loss and has a high risk of central nervous system (CNS) relapse8, timely and accurate screening is crucial for improving patient outcomes and effectively managing this aggressive malignancy.

Although standardized diagnostic protocols for PVRL exist, the rarity and fragility of lymphoma cells in the vitreous often hinder prompt detection, making timely diagnosis challenging9. For example, (1) cytological examination offers morphological evidence that supports a diagnosis of PVRL1; (2) polymerase chain reaction is utilized to detect monoclonality through rearrangements in immunoglobulin heavy chain or light chain genes, which serve as critical diagnostic indicators10; (3) mutations in genes, such as MYD88 and CD79B have been linked to vitreoretinal lymphomas and may improve diagnostic accuracy11,12; and (4) elevated levels of interleukin (IL)−10 compared with IL-6 are also significant, as B-cell lymphomas typically produce high IL-10 levels, making these ratios valuable markers for diagnosis12,13. However, despite the variety of available methods, all of these methods require invasive intraocular sampling, and their effectiveness is constrained by the need to obtain a substantial number of viable cells—often limited by the low yield of intact neoplastic cells from small volumes of vitreous fluid6. Moreover, these methods are geared primarily towards diagnosing established diseases rather than enabling early screening. Therefore, there is an urgent need for a rapid, accurate, and practical screening method for this malignancy.

The complete blood count (CBC) is one of the most frequently ordered clinical tests across nearly all medical contexts, offering valuable and timely insights into a wide range of disease processes14. Because blood cells continuously interact with various tissues and organs, CBC is a powerful diagnostic tool. Its key advantages include its low cost, accessibility, high consistency, and widespread use in primary healthcare, making it an essential component of routine medical evaluation15. Its application in clinical practice is extensive, and some tests (e.g., lymphocytes, basophils, and hemoglobin) have demonstrated significant diagnostic and prognostic relevance for lymphoma16,17,18,19,20. However, no research has been conducted to screen for PVRL using a CBC.

In this work, we combine CBC parameters with machine learning (ML) algorithms to screen for PVRL. We conduct a multicentre case–control study to develop a machine learning–based screening model for PVRL diagnosis from CBC data, and validate its performance in large-scale clinical cohorts.

Results

The study design is illustrated in Fig. 1A. No significant differences in age or sex were observed between the PVRL and normal control groups across the discovery cohort and validation cohorts 1–3 (Tables S1 and S2, P > 0.05). Approximately 50% of the CBC parameters differed significantly between PVRL patients and controls in both discovery and validation Cohort 1 (P < 0.05, Table S1). Except for the basophil count (P = 0.049), no significant differences were noted between discovery and validation cohort 1 for key parameters (Table S3).

Fig. 1: Study design and performance of 12 machine learning (ML) models in the discovery cohort.
figure 1

A Schematic representation of the study design, illustrating the development and validation of a ML model for primary vitreoretinal lymphoma (PVRL) diagnosis using complete blood count parameters. B Receiver operating characteristic (ROC) curves comparing the diagnostic performance of 12 machine learning models on the basis of complete blood count data. C Summary of key performance metrics for the 12 machine learning models, including area under the curve (AUC), sensitivity, and specificity. RF random forest, DT decision tree, GLM generalized linear model, GBM gradient boosting, KNN K nearest neighbor, PDW platelet distribution width, PLCR platelet large cell ratio, HG hemoglobin, CNS central nervous system, PPV positive predictive value, NPV negative predictive value.

Development of screening models based on all features

All 30 CBC features were used to train models across 12 ML algorithms. The random forest (RF), XGBoost, Tabnet, decision tree (DT), generalized linear model (GLM), gradient boosting (GBM), LightGBM, naïve Bayes, and AdaBoost models demonstrated superior performance (Table S4; Figs. 1B and S1), with area under the receiver operating characteristic curve (AUC)s significantly greater than those of the other models (P < 0.05, AUC range for others: 0.42–0.67).

Area under the precision-recall curve (AUPRC) analysis further confirmed the superior performance of these models (Fig. S2). Comprehensive performance metrics, including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1 score, are summarized in Fig. 1C and Table S5. On the basis of these results, the top-performing nine algorithms were selected for further model development.

Identification of the final model based on six features

Feature selection was applied in the discovery cohort to enhance clinical applicability. We ranked all 30 features by their mean SHapley Additive exPlanations (SHAP) values and incrementally built models by adding features one by one in descending order. Each model’s performance was compared to the full 30-feature model using the DeLong test. As shown in Fig. 2A, the RF model using the top 6 features achieved an AUC slightly higher than that of the full 30-feature model. In contrast, the model with only the top 5 features showed a significant decrease in AUC (p < 0.05). Furthermore, the RF model with these 6 features outperformed models built using other algorithms regardless of the number of features included.

Fig. 2: Feature selection and final model identification in the discovery cohort.
figure 2

A Area under the receiver operating characteristic curve (AUC) values for different feature subsets selected using the SHapley Additive exPlanations (SHAP) method, along with feature ranking scores. Source data are provided as a Source data file. B SHAP summary bar plot highlighting the relative importance of features in the random forest (RF) model. C Illustrating the distribution and correlation of the six selected features. D Receiver operating characteristic (ROC) curves demonstrating the diagnostic performance of the RF model when the six selected features were used for primary vitreoretinal lymphoma (PVRL) detection. E Confusion matrix heatmap visualizing the classification performance of the six-feature RF model in diagnosing PVRL. DT decision tree, GLM generalized linear model, GBM gradient boosting, PDW platelet distribution width, PLCR platelet large cell ratio, HG hemoglobin, PLT platelet count, MPV mean platelet volume, HCT hematocrit, RDWSD red blood cell distribution width—standard deviation, RDWCV red blood cell distribution width—coefficient of variation, RBC red blood cell count, WBC white blood cell count, MCHC mean corpuscular hemoglobin concentration, PIV pan-immune inflammation value.

The RF model, incorporating six features, platelet distribution width (PDW), monocyte%, platelet large cell ratio (PLCR), monocyte count, hemoglobin (HG), and basophil count, retained near-optimal discriminatory ability, as reflected by global SHAP analysis (Fig. 2B) and feature distribution/correlation analysis (Fig. 2C).

The 6-feature RF model achieved an AUC of 0.85 (Fig. 2D) and an AUPRC of 0.84, with a PPV of 0.76, NPV of 0.71, accuracy of 0.73, and F1 score of 0.75 (Table S6). The confusion matrix (Fig. 2E) revealed a sensitivity of 0.81 and specificity of 0.75.

External Validation and Comparative Performance of the Final RF Model Versus IL-6, IL-10, and the IL-10/IL-6 Ratio.

In validation cohort 1, the 6-feature RF model achieved an AUC of 0.80 (Fig. 3A), an AUPRC of 0.84 (Fig. 3B), and an accuracy of 0.68 (Fig. 3C). The sensitivity and specificity were 0.60 and 0.75, respectively. Similar results were obtained in validation cohort 2 (AUC 0.80, AUPRC 0.69; Fig. 3D–F). In validation cohort 3, which included PVRL-CNS patients, the model achieved an AUC of 0.83 and an AUPRC of 0.72, with 0.79 sensitivity and 0.79 specificity (Fig. 3G).

Fig. 3: Validation of the final random forest (RF) model in three independent cohorts.
figure 3

A Receiver operating characteristic (ROC) curves demonstrating the diagnostic performance of the six-feature RF model for primary vitreoretinal lymphoma (PVRL) detection in validation cohort 1. B Precision‒recall (PR) curves illustrating the precision‒recall performance of the six-feature RF model for PVRL detection in validation cohort 1. C Confusion matrix heatmap visualizing the classification performance of the six-feature RF model for PVRL detection in validation cohort 1. D ROC curves used to evaluate the performance of the RF model in detecting PVRL and PVRL with central nervous system involvement (PVRL-CNS) in validation cohorts 2 and 3. E PR curves displaying the precision‒recall performance of the six-feature RF model for PVRL/PVRL-CNS detection in validation cohorts 2 and 3. F Confusion matrix heatmap depicting the classification performance of the six-feature RF model for PVRL detection in validation cohort 2. G Confusion matrix heatmap illustrating the classification performance of the six-feature RF model for PVRL-CNS detection in validation cohort 3. H ROC curves used to evaluate the performance of the RF model in distinguishing PVRL from uveitis. I PR curves displaying the precision‒recall performance of the six-feature RF model for distinguishing PVRL from uveitis. J Confusion matrix heatmap illustrating the classification performance of the six-feature RF model for distinguishing PVRL from uveitis. K ROC curves illustrating the diagnostic performance of the IL-10/IL-6 ratio in aqueous humor for detecting PVRL and PVRL with central nervous system involvement (PVRL-CNS). L ROC curves illustrating the diagnostic performance of the IL-10/IL-6 ratio in the vitreous humor for detecting PVRL and PVRL-CNS. PPV positive predictive value, NPV negative predictive value, AUC area under the curve.

In the differentiation cohort distinguishing PVRL from uveitis, the RF model maintained high performance (AUC 0.81, AUPRC 0.76; Fig. 3H–J), confirming its robustness across diverse clinical settings.

As summarized in Tables S7 and S8, no significant differences in age or sex distribution were observed between the PVRL patients and controls across all cohorts. In aqueous humor, PVRL patients presented significantly elevated IL-10 and IL-10/IL-6 ratios and decreased IL-6 levels (all P < 0.05). The AUCs for IL-6, IL-10, and IL-10/IL-6 were 0.65, 0.66, and 0.77, respectively (Fig. S3A, B, 3K), and those for the PVRL-CNS were 0.67, 0.75, and 0.78, respectively (Figs. S3E, F, 3K). In the vitreous humor, similar trends were observed (P < 0.05), with AUCs of 0.69, 0.74, and 0.77 (Figs. S3C, D, and 4L) and 0.72, 0.74, and 0.74 for the PVRL-CNS (Figs. S3G, H, and 4L). The RF model consistently outperformed IL-6, IL-10, and the IL-10/IL-6 ratio across all analyses (DeLong test, P < 0.05).

Model clinical utility, relevance, and interpretability

Decision curve analysis (DCA) demonstrated that the 6-feature RF model provided consistent net clinical benefit across a broad range of threshold probabilities (Fig. S4A–E). The calibration curves further confirmed the model’s robust screening performance (Fig. S4F–H).

To evaluate parameter dynamics during treatment, 29 PVRL patients were followed for six months (study design: Fig. S5A). Fifteen patients were classified as nonresponders, and 14 were classified as responders. In nonresponders, PDW, HG, and the PLCR significantly decreased, whereas basophil, monocyte, and monocyte percentages increased (P < 0.05; Fig. S5B). Conversely, in responders, PDW, HG, and the PLCR increased, whereas monocyte and monocyte percentages decreased (P < 0.05; Fig. S5C).

SHAP analysis elucidated model decision-making: the global SHAP summary plot ranked feature importance (Fig. 4A), and local SHAP explanations provided patient-specific interpretability (Fig. 4B, C). The probability of PVRL was calculated according to the 6-feature RF model. The optimal cutoff value was determined to be 0.85 based on the Youden index. Probability values of 0.85 or greater denote high risk, and values of less than 0.85 denote low risk (Fig. 4D). To enhance clinical adoption, an interactive web application was developed (https://primary-vitreoretinal-lymphoma-prediction-app.streamlit.app/), enabling real-time risk prediction based on input of the six-feature values (Fig. 4E).

Fig. 4: Interpretability of the six-feature random forest (RF) model.
figure 4

A SHapley Additive exPlanations (SHAP) summary bar plot ranking the importance of the six selected features in the RF model. B Local explanation analysis illustrating the prediction for non-primary vitreoretinal lymphoma (PVRL) participants. C Local explanation analysis illustrating the prediction for PVRL participants. D RF model-predicted probabilities for positive cases (n = 252) and negative cases (n = 292). Each dot represents one individual, and no technical replicates were used. The box-and-whisker plots display the distribution of predicted probabilities. The central box represents the interquartile range (IQR), spanning from the first quartile (Q1, 25th percentile) to the third quartile (Q3, 75th percentile). The horizontal line within the box indicates the median (Q2, 50th percentile). The whiskers extend to the minimum and maximum values within 1.5 × IQR from Q1 and Q3, respectively, while data points beyond this range are considered outliers. E Clinical application interface: upon entering actual values of the six features, the tool automatically predicts PVRL risk. PDW platelet distribution width, PLCR platelet large cell ratio, HG hemoglobin.

Hospital-based prospective cohort study validation of the final RF model

A total of 100,526 participants aged 18–92 years were enrolled for PVRL screening (Fig. 5A). Of these, 94,935 met the eligibility criteria. The detailed distributions of the categorized diseases are provided in Table S9. The participants’ data were entered in real time into an online web application hosted at https://primary-vitreoretinal-lymphoma-prediction-app.streamlit.app/ for PVRL risk assessment. A scatter plot (Fig. 5B) of the predicted probabilities based on the RF model is shown for all included individuals. On the basis of the screening results, 77 individuals were identified as high risk for PVRL (predicted probability ≥ 0.85). After excluding 11 individuals for various reasons, 66 high-risk individuals were referred to the Department of Ophthalmology for further evaluation. Among them, 38 cases were confirmed as PVRL by vitreous biopsy. Among the remaining 28 patients, 8 were diagnosed with diffuse large B-cell lymphoma, 14 had other ocular or systemic conditions, and 6 were diagnosed with mucosa-associated lymphoid tissue lymphoma. Among the 94,858 individuals classified as low risk (predicted probability less than 0.85), 12,248 were excluded for various reasons, and 83,610 low-risk individuals were referred to the Department of Ophthalmology & Otorhinolaryngology for further evaluation. Only 2 cases of PVRL were identified at the EENT Hospital. The final RF model demonstrated a sensitivity of 95.0%, a specificity of 99.97%, a PPV of 57.6%, and an NPV of 99.99%.

Fig. 5: The screening process.
figure 5

A The screening process of the hospital-based prospective cohort study. B A scatter plot of the predicted probabilities based on the random forest (RF) model for all included individuals in the hospital-based prospective cohort study. C The screening process of the community-based cross-sectional study. D A scatter plot of the predicted probabilities based on the RF model for all included individuals in the community-based cross-sectional study. PVRL primary vitreoretinal lymphoma.

Community-based cross-sectional study validation of the final RF model

A total of 515,326 participants aged 40–88 years were enrolled in the PVRL screening program (Fig. 5C). Among them, 511,786 met the eligibility criteria, and their data were entered in real time into an online web application hosted at https://primary-vitreoretinal-lymphoma-prediction-app.streamlit.app/ for PVRL risk assessment. A scatter plot (Fig. 5D) of the predicted probabilities based on the RF model is shown for all included individuals. Based on the model’s output, 22 individuals were identified as being at high risk for PVRL.

These high-risk participants were referred to the Department of Ophthalmology for further evaluation, resulting in the confirmation of 13 PVRL cases. Owing to the cross-sectional nature of the community-based study, follow-up confirmation was not conducted for participants in the low-risk group. The final RF model demonstrated a PPV of 59.1%.

Discussion

In this study, a six-feature-based RF ML model using complete blood count parameters was successfully developed and validated, and a noninvasive screening tool for PVRL was established. The model demonstrated robust screening accuracy in the discovery cohort, three independent validation cohorts, and the diagnostic differentiation cohort, significantly outperforming conventional biomarkers in the vitreous humor/aqueous humor, such as the IL-10/IL-6 ratio. Longitudinal monitoring further validated the biological relevance of the selected features, with dynamic blood parameter changes aligning with treatment responses. More importantly, the performance of the PVRL screening model was validated through two large sample cohort studies. These findings address a critical unmet need in PVRL screening.

PVRL diagnosis is often delayed because of nonspecific symptoms and the reliance on invasive procedures, such as vitreous biopsy21,22. Our model overcomes these limitations by leveraging the CBC, making this model cost-effective, widely accessible, and minimally invasive. The six selected CBC parameters, namely, monocyte percentage, PLCR, monocyte count, HG, and basophil count, are routinely measured in clinical practice, making this tool particularly valuable for primary care settings14. The deployment of a freely accessible web application enhances real-time risk assessment, enabling rapid decision-making in resource-limited environments or when ocular sampling is unavailable. Longitudinal analysis revealed that nonresponders had decreases in PDW, HG, and PLCR alongside increases in posttreatment monocyte and basophil counts, mirroring the model’s feature trends and reinforcing their association with disease activity. Further research is needed to elucidate the underlying mechanisms driving these associations.

While the IL-10/IL-6 ratio in the vitreous humor/aqueous humor remains a diagnostic cornerstone for PVRL, its performance (AUC: 0.65–0.78) was significantly inferior to that of our blood-based model (P < 0.05). This disparity may arise from the inherent limitations of ocular sampling: low cellular yield, rapid degradation of biomarkers, and technical variability. In contrast, blood parameters capture systemic immune responses and tumor–host interactions, offering a more comprehensive profile. Furthermore, standardized blood testing minimizes operational variability, increasing reproducibility across clinical settings.

Furthermore, Gozzi et al.23 developed a classification model using Python and XGBoost; in our model, 87% of eyes were correctly diagnosed as PVRL or uveitis (including Fuchs uveitis, sarcoidosis uveitis, Behçet uveitis, and uveitis of unknown origin). Consistent with Gozzi et al.23, who demonstrated that radiomic analysis of anterior segment optical coherence tomography images can noninvasively distinguish PVRL from uveitis of various etiologies, our findings further support the potential of machine learning–based approaches in assisting the differential diagnosis.

The multicentre case‒control design, coupled with validation in two large-sample cohorts, reinforces the model’s generalizability and clinical reliability. However, several limitations should be acknowledged. First, the relatively small sample size of the PVRL-CNS cohort constrains the model’s predictive power for CNS involvement. Second, owing to the cross-sectional design of the community-based study, follow-up confirmation was not performed for individuals in the low-risk group. Consequently, the model’s sensitivity may be overestimated, whereas its specificity and positive predictive value may be underestimated. Third, some patients classified as low risk by our model did not undergo gold standard biopsy-based confirmation of PVRL status, particularly in the hospital-based and community-based validation cohorts. Consequently, some patients categorized as low risk may in fact have had undiagnosed PVRL, representing potential false negatives. This inherent limitation in diagnostic verification could have led to an underestimation of the true prevalence of PVRL and may have affected the reported performance metrics of our predictive model. Last, a limitation of this study is that, in both the hospital-based prospective cohort and the community-based cross-sectional screening program, PVRL screening was restricted to individuals aged over 40 years. Although the model was developed and validated in cohorts with a broader age range (≥18 years), its performance in younger (<40-year-old) populations remains to be extensively evaluated in large-scale, population-based settings.

By integrating machine learning models with routine blood test data, this study proposes a noninvasive and high-accuracy screening tool for PVRL. Nevertheless, it should be emphasized that any CBC-based machine learning approach for PVRL screening must be confirmed by positive pathological findings from a vitreous biopsy. In clinical practice, such a tool has the potential to markedly reduce diagnostic delays, facilitate timely intervention, and ultimately improve outcomes in patients with this aggressive malignancy. Future research should aim to elucidate the underlying biological mechanisms and evaluate the model’s applicability in therapeutic monitoring, thereby maximizing its transformative potential in ophthalmic oncology.

Methods

Ethical considerations

This study received ethical approval from the institutional review boards of the participating institutions: Huashan Hospital of Fudan University (2023-515), Xuhui Central Hospital (2018025), the Eye and ENT Hospital of Fudan University (2020[2020013]), and Wanbei Coal-Electricity Group General Hospital (WBZY-LLWYH-2024-21). The study was conducted in accordance with the Declaration of Helsinki and the Ethical Guidelines for Medical and Health Research Involving Human Participants. Given that the multicentre case–control study was noninterventional and retrospective in nature, the institutional review boards waived the requirement for informed consent. In the prospective hospital-based screening phase, all participants provided written informed consent, signed before sample collection and study enrollment. Sex of participants was recorded as male or female according to hospital registration data, which were based on self-reported information at the time of admission. Due to the limited sample size, no sex- and/or gender-based analyses were performed.

Study design and participants

The PVRL screening model was developed through a multicentre, case‒control study in which patients were systematically identified and categorized based on predefined diagnostic criteria. This case‒control, multicentre study was conducted in 4 hospitals across China between January 1, 2016, and June 30, 2024. The discovery cohort comprised PVRL patients from the Eye and ENT Hospital of Fudan University. The first validation cohort included PVRL patients from Wanbei Coal-Electricity Group General Hospital and Xuhui Central Hospital of Fudan University, and the second validation cohort included PVRL patients from Huashan Hospital of Fudan University. The third validation cohort included PVRL patients with CNS involvement from Huashan Hospital of Fudan University. Healthy controls were recruited from the health examination centres of the same hospitals where the PVRL patients were enrolled. Furthermore, the healthy controls were matched to cases by age and gender. A one-to-one exact matching strategy was applied, with cases and controls matched on age and sex, to ensure comparable demographic characteristics between groups and to minimize potential confounding effects. Finally, 100 PVRL participants and 117 normal controls were included in the discovery cohort (Fig. S6A); 42 PVRL participants and 60 normal controls were included in the first validation cohort (Fig. S6B); 36 PVRL participants and 42 normal controls were included in the second validation cohort (Fig. S6C); and 77 PVRL-CNS participants and 73 normal controls were included in the third validation cohort (Fig. S6D).

Owing to the heterogeneous and often nonspecific clinical PVRL, it is frequently misdiagnosed as uveitis. To address this, a diagnostic differentiation cohort was established, consisting of 155 patients with PVRL and 158 with uveitis. Notably, the 155 PVRL patients overlapped with those in the previously described validation cohorts 1, 2, and 3. The uveitis patients were recruited from the Eye and ENT Hospital of Fudan University. Among the 158 patients with uveitis, 132 had endogenous, non-infectious causes, such as Fuchs uveitis, sarcoidosis uveitis, and Behçet uveitis; 26 cases had exogenous, infectious causes, including bacterial endophthalmitis–associated uveitis (n = 7) and herpetic uveitis (n = 19).

A follow-up cohort was established to assess dynamic changes in CBC parameters during PVRL treatment. In total, 29 PVRL patients were included, with overlap with the previously described PVRL patients from the Eye and ENT Hospital of Fudan University.

To further evaluate the diagnostic performance of CBC tests and interleukins in aqueous/vitreous humor for PVRL, multiple cohorts from the Eye and ENT Hospital and Huashan Hospital were included. Notably, these PVRL participants are also part of the previously described cohorts from the Eye and ENT Hospital and Huashan Hospital of Fudan University.

The performance of the PVRL screening model was validated through two large sample cohort studies: one hospital-based prospective cohort study and one community-based cross-sectional study.

A prospective screening program was initiated at the Eye and ENT Hospital of Fudan University. In brief, all patients presenting to the hospital were sequentially enrolled from October 2024 to May 2025. CBC tests were performed for all eligible participants. Individuals who screened positive for PVRL using the established screening model were further evaluated and confirmed by histopathological examination of biopsy samples. As a tertiary referral center for PVRL, the Eye and ENT Hospital of Fudan University receives a substantial number of patients with suspected disease. As a result, the prevalence of PVRL in this prospective screening program is significantly higher than that in the general population.

In parallel, a community-based cross-sectional screening program was launched in Xuhui District, Shanghai, in July 2024. Residents of Xuhui District were sequentially invited to participate through local community health service centres. CBC tests were conducted for all eligible participants, and those who screened positive for PVRL were referred for confirmatory diagnosis via histopathological analysis of biopsy samples. Due to the low prevalence of PVRL and its higher incidence in the elderly population, we limited the prospective screening program to individuals aged over 40 years to enrich the prevalence of PVRL within this community-based cross-sectional screening cohort. As a result, the prevalence of PVRL in this community-based cross-sectional screening program is significantly higher than that in the general population.

The detailed study design and participant information are provided in the Supplementary Materials. The study design is illustrated in Fig. 1A.

Diagnostic criteria for PVRL

The diagnosis of PVRL was established based on the 2016 World Health Organization classification of lymphoid neoplasms24. All patients with PVRL or CNS lymphoma (CNSL) involvement underwent biopsy procedures. PVRL was diagnosed based on positive pathological findings from vitreous biopsy, following established standards for vitreous sampling and biopsy-based diagnostic criteria9,25. Undiluted vitreous samples were obtained via dry vitrectomy and processed for cytological analysis. A positive pathological finding was defined as follows: conventional smear cytology demonstrating large lymphoid cells with irregularly shaped nuclei, multiple prominent nucleoli, and scant basophilic cytoplasm, typically accompanied by small reactive T lymphocyte infiltration, consistent with large lymphomatous cells.

CNSL was diagnosed when a positive pathological finding was obtained from CNS tissue biopsy. Patients in whom CNSL was ruled out at the time of PVRL diagnosis were classified as having isolated PVRL. Patients diagnosed simultaneously with CNSL were classified as having PVRL with concurrent CNS involvement. In accordance with the guidelines of the International Primary CNS Lymphoma Collaborative Group26 and the European Association of Neuro-Oncology27, all patients underwent 2-deoxy-2-[F-18]fluoro-D-glucose positron emission tomography/computed tomography (FDG PET/CT) and/or bone marrow aspiration to exclude systemic lymphoma involvement.

Model development

In total, 30 features were employed to develop the diagnostic models. These features included: neutrophil count, neutrophil percentage, red blood cell count, thrombocytocrit, platelet count (PLT), PDW, HG, eosinophil count, eosinophil percentage, basophil count, basophil percentage, mean platelet volume, lymphocyte count, lymphocyte percentage, hematocrit, monocyte count, monocyte percentage, PLCR, white blood cell count, red blood cell distribution width—standard deviation, red blood cell distribution width—coefficient of variation, mean corpuscular volume, mean corpuscular hemoglobin concentration, mean corpuscular hemoglobin, platelet-to-lymphocyte ratio (PLT/lymphocyte count), neutrophil-to-lymphocyte ratio, lymphocyte-to-monocyte ratio, PLT × neutrophil-to-lymphocyte ratio, PLT × neutrophil × monocyte-to-lymphocyte ratio, and neutrophil × monocyte × PLT-to-lymphocyte ratio.

Initially, features exhibiting a correlation coefficient exceeding 0.8 were removed to reduce multicollinearity. In this study, all included variables showed correlations below 0.8. To handle missing values, the median imputation method was employed28. Twelve ML models—AdaBoost, DT, GLM, GBM, K-nearest neighbors, LightGBM, multilayer perceptron classifier, naïve Bayes, RF, support vector machine, TabNet, and XGBoost—were utilized, each leveraging its unique strengths to increase model diversity. To optimize the predictive models, we employed a two-stage hyperparameter tuning strategy. First, a grid search was used to define multiple candidate hyperparameter configurations. A nested 5-fold cross-validation was then performed for each candidate configuration in the discovery cohort. Specifically, the discovery cohort was evenly split into 5 folds, with 1 fold used as the validation set and the remaining four as the training set. Each candidate configuration was applied consistently across all 5 folds, meaning that the cross-validation was used to evaluate the performance of pre-specified hyperparameters rather than to generate different hyperparameters for each fold. The same hyperparameter settings were used to train the model on each of the five training folds and to evaluate it on the corresponding validation fold, resulting in five AUC values for each candidate configuration. Second, the mean-optimal configuration was identified by calculating the mean AUC across all folds for each candidate configuration. The configuration achieving the highest mean AUC was selected as the final hyperparameter set. Minor manual fine-tuning (e.g., adjustment of learning rate or max depth) was performed only around this final configuration to enhance stability. The final selection criterion remained the highest mean performance from the 5-fold cross-validation, ensuring the adoption of a robust and generalizable hyperparameter configuration. The final model, trained with the optimized hyperparameters, was then applied to the validation cohorts to generate the reported results. The final hyperparameters of these twelve ML models are shown in Table S10. Each method was chosen for its balance of simplicity, which allows for interpretable results and power, enabling the modeling of complex linear and nonlinear interactions among input features. Modeling was conducted in Python (v 3.11). The code is publicly available via GitHub–Zenodo and can be accessed using the https://zenodo.org/records/17189239.

Model performance evaluation

The models were assessed using various metrics, including sensitivity, specificity, PPV, NPV, F1 score, and accuracy. Their classification capabilities were further evaluated using the area under the receiver operating characteristic curve and the area under the precision‒recall curve. To compare the AUC values across different ML models, the DeLong test29, a nonparametric method, was utilized.

Feature selection and model explanation

In the discovery cohort, sample features were selected using the SHAP method30. Feature selection was performed to reduce the number of predictors while maintaining optimal model performance. In the discovery cohort, we applied a sequential forward-selection procedure guided by SHAP importance and statistical evaluation as follows: (1) ranking by SHAP importance: a model was trained on the full discovery cohort to compute mean SHAP values for all 30 features, generating a global importance ranking; (2) iterative model building: starting from the top-ranked feature, models were built incrementally by adding one feature at a time in descending SHAP order (top 1, top 2, …, up to all 30 features); (3) performance evaluation: in the discovery cohort, the AUC for each model was calculated using 5-fold cross-validation to obtain stable estimates; (4) statistical comparison: in the discovery cohort, each model’s performance was compared to that of the full 30-feature model using the DeLong test29; (5) final selection: to balance parsimony and predictive performance, we selected the model with the fewest features whose performance was statistically non-inferior or superior to that of the full 30-feature model. In the discovery cohort, during feature selection, the DeLong test29, a nonparametric method, was used to evaluate differences in AUC values before and after feature selection. The SHAP method supplies both global and local explanations for the model. The global explanation delivered consistent and precise attribution values for each feature, emphasizing the connections between the input features and PVRL. Conversely, the local explanation demonstrated specific predictions for individual PVRL cases on the basis of their respective input data.

Web-based model deployment

To facilitate clinical implementation, the final model was deployed as a web application using the Streamlit framework in Python. By inputting the relevant feature values, the application provides a probability estimate for PVRL and generates a personalized force plot, enhancing interpretability for individual cases.

Statistical analysis

All the statistical analyses were conducted using GraphPad Prism (version 10) and Python (version 3.11). The normality of the data distribution was assessed via the Kolmogorov–Smirnov test. For normally distributed continuous variables, paired or independent Student’s t tests were applied, whereas the Kruskal–Wallis test was used for nonnormally distributed data. Categorical variables were analysed using the chi-square test where appropriate. Continuous variables are reported as the means ± standard deviations (SDs), and categorical variables are expressed as counts and percentages. ROC curve analysis was performed to evaluate the diagnostic value of interleukins. The AUC was calculated to assess the overall diagnostic performance. The optimal cutoff value was determined using the Youden index (sensitivity + specificity−1). Sensitivity and specificity corresponding to the optimal cutoff were also reported. A two-tailed P value < 0.05 was considered statistically significant.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.