Introduction

Chronic kidney disease (CKD) is a substantial worldwide public health problem affecting ~850 million people, and will be life-threatening in the end stage1,2,3. As the gold standard for the pathological diagnosis and prognosis of CKD, renal biopsy provides significant benefits for both patients and physicians by improving understanding of the disease state and indicating the extent of active and chronic lesions4, enabling physicians to make targeted treatment plans to preserve kidney function and reduce the incidence of kidney failure and death5. However, renal biopsy is often limited by high-risk anatomical characteristics, systemic diseases with high mortality, and the lack of biopsy techniques and nephrology services in less developed regions6,7. The most common complication after percutaneous renal biopsy is bleeding, with hematoma occurring in 11% of cases and transfusion required in 1.6% of cases8,9. Additionally, there is an immense shortage of nephrologists in low-income regions, and significant gaps and variability exist in the availability and accessibility of kidney care compared with other regions10. When patients are unable to undergo renal biopsy or lack access to pathological diagnoses, treatment plans are always reliant on the subjective judgment of nephrologists, and the prognoses are difficult to determine.

Several efforts, such as those utilizing the proteome11, microRNA12, and molecular imaging13, have been investigated for noninvasive diagnosis and prognostic prediction for CKD. A recent breakthrough study revealed the significant associations between plasma proteins and kidney pathologic lesions, which is expected to provide novel biomarkers for the noninvasive diagnosis of CKD14. However, the marker specificity and causal relationships require further clarification, and the associations have not been independently validated. Additionally, medical imaging has become an important tool for CKD diagnosis through the noninvasive observation of renal structural abnormalities. Artificial intelligence (AI)-assisted medical image analysis has been studied extensively, which can help physicians identify imaging features that cannot be easily discerned by the naked eye, providing valuable information for the early detection, pathological, and prognostic assessment of CKD15,16. Despite the positive advances in AI-assisted medical image analysis for CKD-related detection, a noninvasive model for renal biopsy prediction has not yet been achieved.

Alterations in microvascular structure and function often precede end-organ damage, and microvascular dysfunction in peripheral beds can mirror dysfunction in visceral beds17. As a convenient window to observe microvascular and neural tissues, the retina can be used for noninvasive detection, diagnosis, staging, and management of systemic diseases18,19,20. Based on deep learning (DL) algorithms, qualitative associations have been established between ocular features and systemic diseases, such as hepatobiliary diseases21, diabetes22, and cardiovascular diseases23, providing rapid and complementary methods for the screening and identification of systemic diseases24. Specifically, there is a graded relationship between retinopathy severity and declining renal function25,26 that can guide CKD detection and onset prediction via retinal image-based DL models27,28. However, primary detection cannot meet further clinical management demands for CKD patients, and accurate treatment decision-making and prognostic evaluation still rely on renal biopsies29. Therefore, it remains urgent to develop an AI model that is as effective as invasive renal biopsy for noninvasive pathological and prognostic prediction of CKD.

In this study, we developed the kidney intelligent diagnosis system (KIDS), a noninvasive model for early detection, pathological diagnosis, and long-term prognosis prediction of CKD using retinal images or in combination with clinical data. We trained the KIDS using retinal images from CKD patients who underwent renal biopsy, or in combination with clinical data. This will result in the accurate and effective prediction of pathological types of CKD. The generalizability of the models was validated on an external multicenter and multi-ethnicity dataset from South China, Kashi in the Xinjiang Autonomous Region, and Somalia. Moreover, the performance of the KIDS was compared with that of 12 nephrologists for pathological diagnosis in a prospective dataset. The KIDS was capable of achieving the diagnostic ability of invasive renal biopsy. We also demonstrated that the KIDS has the ability to stratify CKD patients according to prognosis and risk for 5 years. In general, by providing noninvasive, objective pathological and prognostic predictions for CKD patients, this noninvasive model has the potential to improve kidney care quality and reduce the incidence of end-stage renal disease (Fig. 1), particularly in less developed regions.

Fig. 1: A kidney intelligent diagnosis system (KIDS) based on retinal images and clinical data for noninvasive pathological diagnosis and prognosis prediction of chronic kidney disease.
Fig. 1: A kidney intelligent diagnosis system (KIDS) based on retinal images and clinical data for noninvasive pathological diagnosis and prognosis prediction of chronic kidney disease.The alternative text for this image may have been generated using AI.
Full size image

A Illustration of the geographical distribution map of data sources used in the Kidney Intelligent Diagnosis System (KIDS). The development set, external and multicenter, and multi-ethnic real-world test datasets were collected from 9 hospitals. Data collection included retinal images, renal biopsy, and laboratory and ultrasound examinations. B We used retinal images to develop deep learning models for CKD screening. Combined with nephropathy examinations, a multimodal model was developed to predict 5 renal pathology classifications. Then, based on the predicted renal pathology, prognosis predictions were made for CKD patients. AI artificial intelligence, ZOC Zhongshan Ophthalmic Center of Sun Yat-sen University, FAH First Affiliated Hospital of Sun Yat-sen University, ZPH Zhongshan City People’s Hospital, FPH First People’s Hospital of Foshan, AHY Affiliated Hospital of Youjiang Medical University for Nationalities, FPHK First People’s Hospital of Kashi, SPTCMI Shanxi Provincial Traditional Chinese Medicine Institute, SAH The Second Affiliated Hospital of Xi’an Jiaotong University, BH Banadir Hospital in Somalia. IgAN IgA nephropathy, MN idiopathic membranous nephropathy, DN diabetic nephropathy, ANS arterionephrosclerosis, MCD/FSGS idiopathic minimal change disease and focal segmental glomerulosclerosis.

Results

Characteristics of the datasets

The participant characteristics are summarized in Table 1. For the development and internal testing of the CKD screening DL model, a total of 7609 retinal images from 3965 participants (1731 patients, 2234 controls) were included. Almost half of the patients (46.2%) were in the early stage of CKD, with 17.6% in the moderate stage and 36.3% in the advanced stage. For external validation, we used an external test set consisting of 1461 retinal images from 743 participants with annual health check results.

Table 1 Baseline characteristics of the participants

To predict the pathological diagnosis and the percentage of sclerotic glomeruli (PSG) stage of CKD, 4252 retinal images from 2139 CKD patients who underwent renal biopsy were included from the First Affiliated Hospital of Sun Yat-sen University [FAH, Guangzhou, China] retrospective cohort for the training, validation, and internal testing of the pathological prediction model. The median eGFR was 52.0 mL/min/1.73 m2 (interquartile range [IQR]: 21.4–95.2). IgA nephropathy (IgAN, n = 1059) was the most common pathological type in the datasets, followed by idiopathic membranous nephropathy (MN, n = 547), arterionephrosclerosis (ANS, n = 382), diabetic nephropathy (DN, n = 342), idiopathic minimal change disease (MCD), and focal segmental glomerulosclerosis (FSGS) (MCD/FSGS, n = 261). A total of 179 cases diagnosed with two or more pathological types from the above five types were individually counted for each type. Patients with MCD/FSGS (34 [IQR 19–53]) were younger than those with MN (54 [IQR 45–63]; p < 0.001). A total of 7.5% of patients who underwent biopsy presented high PSG (>75%), which was higher in DN patients (13.4%). The demographics and clinical information of the participants with different pathological types are listed in Supplementary Table 1. To assess the generalizability of the pathological diagnosis and PSG staging AI models, an external test dataset containing 358 individuals with 708 images was collected from three other independent centers: Zhongshan City People’s Hospital [ZPH, Zhongshan, China], First People’s Hospital of Foshan [FPH, Foshan, China] and the Affiliated Hospital of Youjiang Medical University for Nationalities [AHY, Youjiang, Guangxi Zhuang Autonomous Region, China]. Furthermore, real-world test datasets were used to further assess the generalizability of the models. These datasets included 272 CKD patients with 34 retinal images from the First People’s Hospital of Kashi [FPHK, Kashi, Xinjiang Uygur Autonomous Region, China], 215 individuals with 73 retinal images from Shanxi Provincial Traditional Chinese Medicine Institute [SPTCMI, Taiyuan, China] and The Second Affiliated Hospital of Xi’an Jiaotong University [SAH, Xi’an, China], and 99 CKD patients from Banadir Hospital [BH, Mogadishu, Somalia] in East Africa. In addition, a total of 574 retinal images prospectively collected from 301 participants with renal biopsy results from FAH were ultimately used as a prospective dataset to test the model’s performance.

To compare the ability of the KIDS with that of human nephrologists in diagnosing pathological conditions, we established a prospective, multicenter test dataset comprising 256 patients with renal biopsy results from four independent centers, FAH, ZPH, FPH, and AHY, collected between May 2023 and November 2023. Patients with MCD/FSGS were generally younger (32.0 years [IQR 9.50–46.50]) and predominantly male (70.4%). In contrast, DN patients were significantly older (52.0 years [IQR 44.5–59.0]). Consistent with the derivation cohort, a greater percentage of DN and ANS patients had elevated PSG levels (>75%), at 13.3% and 19.4%, respectively. (Supplementary Table 5).

CKD screening DL model using retinal images

We developed KIDS to distinguish CKD patients from non-CKD controls (Fig. 2a) and achieved an AUC of 0.959 (95% CI: 0.943, 0.975) for CKD screening in the internal test dataset. Furthermore, since distinguishing early CKD patients from non-CKD controls is essential for prompt disease intervention, we performed a subgroup analysis to assess the ability of the KIDS to distinguish early CKD patients, moderate CKD patients, and advanced CKD patients from controls, which yielded AUCs of 0.936 (95% CI: 0.909, 0.962), 0.947 (95% CI: 0.913, 0.982) and 0.993 (95% CI: 0.987, 0.999), respectively (Fig. 2b). In the external test dataset, the KIDS achieved AUCs of 0.858 (95% CI: 0.831, 0.884) for CKD screening and similar performances of 0.839 (95% CI: 0.806, 0.871), 0.897 (95% CI: 0.852, 0.943) and 0.889 (95% CI: 0.837, 0.942) for early, moderate and advanced CKD detection, respectively, in the subgroup analysis (Fig. 2c, d). The model successfully identified CKD patients at different simulated prevalences, with AUCs stabilized at ~0.960 on the internal test and 0.860 on the external test (Supplementary Table 4).

Fig. 2: Performance of the CKD screening AI model using retinal images.
Fig. 2: Performance of the CKD screening AI model using retinal images.The alternative text for this image may have been generated using AI.
Full size image

ROC curves showing the model’s performance in detecting CKD (a, c) and in classifying CKD stages (early, moderate, and advanced) (b, d) using retinal images alone. Results are shown for the internal test set (a, b) and the external test set (c, d). CKD, chronic kidney disease; eGFR, estimated glomerular filtration rate. Early CKD, eGFR ≥ 60 mL/min/1.73 m2; moderate CKD, eGFR 30–59 mL/min/1.73 m2; advanced CKD, eGFR <30 mL/min/1.73 m2. AUC area under the curve, CI confidence interval. Source data are provided as a Source Data file.

Pathological diagnosis and staging of CKD via a noninvasive AI model

Pathological diagnosis multimodal model

We developed a noninvasive model consisting of three submodels for the KIDS (using image-only, clinical data-only, and hybrid data) to identify different pathological diagnoses in CKD patients. In the image-only model, the AUCs ranged from 0.703 to 0.902 for IgAN, MN, DN, ANS, and MCD/FSGS in the internal test set, whereas the clinical data-only model achieved AUCs between 0.858 and 0.970. Then, we trained a multimodal AI model using retinal images and clinical data combined, which achieved better performance. Compared with those of the clinical data-only model, the AUCs for the hybrid models were greater, but this difference was not statistically significant after applying the Bonferroni correction. The calibration plots and Hosmer-Lemeshow goodness-of-fit tests indicated a good agreement between the observed probabilities and those predicted by both the clinical data-only and hybrid models, with the exception of the hybrid model for ANS (Supplementary Fig. 3). The continuous net reclassification indexes (NRIs) were 0.491–1.134 and the integrated discrimination improvements (IDIs) were 0.029 to 0.063, all of which were statistically significant. The AUCs, continuous NRIs, and IDIs of the KIDS for making pathological diagnoses in the image-only, clinical data-only, and hybrid models are shown in Fig. 3 and Table 2.

Fig. 3: Performance of the pathological diagnosis AI models in image-only, clinical data-only, and hybrid models.
Fig. 3: Performance of the pathological diagnosis AI models in image-only, clinical data-only, and hybrid models.The alternative text for this image may have been generated using AI.
Full size image

ROC curves for the performance of pathological diagnosis of IgAN, MN, DN, ANS, and MCD/FSGS (ae) in the internal test set, prospective test set, and external test set. DN diabetic nephropathy, IgAN IgA nephropathy, MN idiopathic membranous nephropathy, ANS arterionephrosclerosis, MCD/FSGS idiopathic minimal change disease and focal segmental glomerulosclerosis, AUC area under the curve, CI confidence interval. Source data are provided as a Source Data file.

Table 2 Net reclassification improvement and integrated discrimination improvement by incorporating retinal images into hybrid models of pathological diagnosis AI models via internal, prospective, and external tests

To validate the findings in the retrospective cohort, the KIDS was tested on a prospective cohort dataset, and it achieved AUCs of 0.891–0.967 for IgAN, MN, DN, ANS, and MCD/FSGS in the hybrid model. It also achieved similar performances of 0.953 (95% CI: 0.882, 1.000) for the DN and 0.846 (95% CI: 0.754, 0.939) for the ANS in image-only mode. In the external test dataset, the KIDS achieved AUCs of 0.867–0.936 for all 5 pathological types in the hybrid model, and it also achieved similar performances of 0.879 (95% CI: 0.824, 0.933) for DN and 0.711 (95% CI: 0.593, 0.829) for MDC/FSGS in the image-only model. The models’ performance in the external test for each pathological type of CKD was similar to that of the internal test dataset. In addition, adding retinal images significantly improved the performance of the hybrid model. The continuous NRIs and IDIs on the prospective test and external test were similar to those on the internal test (Table 2).

To evaluate the model’s diagnostic capability across diverse populations, we conducted further validation using multi-ethnic CKD patient datasets from Kashi, Taiyuan, and Xi’an. The model achieved AUCs ranging from 0.790 to 0.932 for IgAN, MN, DN, ANS, and MCD/FSGS (Supplementary Fig. 4). Given the underdeveloped economic conditions and healthcare infrastructure in Somalia (one of the least developed regions globally), renal biopsies are not feasible. However, accurate pathological diagnosis is essential for treatment decisions in CKD patients. We conducted an external validation of the simplified model’s diagnostic performance using patient datasets from Somalia. The diagnoses were determined by a panel of senior nephrologists and nephropathologists from China, who served as the reference standard. The simplified model achieved AUCs between 0.630 and 0.870 in the Somalia dataset (Supplementary Fig. 5). The models exhibited satisfactory accuracy across diverse populations in China, as well as Somali patients.

Pathological staging AI model

We developed an AI model for the pathological staging of CKD patients for the KIDS, and its AUC reached 0.901 (95% CI: 0.847, 0.955) in the internal test dataset for identifying patients with high PSG without renal biopsy. When evaluated via prospective and external tests, the KIDS achieved similar AUCs of 0.883 (95% CI: 0.813, 0.954) and 0.919 (95% CI: 0.881, 0.958), respectively (Supplementary Fig. 6). In the real-world test datasets, the KIDS system achieved AUCs ranging from 0.901 to 0.910 (Supplementary Fig. 4c, d), reflecting its robustness across diverse populations. The levels of creatinine, blood urea nitrogen, and hemoglobin were found to be crucial for the construction of this model (Supplementary Fig. 7).

Progression prediction AI model

In our retrospective cohort, 999 patients were followed for kidney outcomes over the period (median, 44 months; IQR, 22–74 months; range, 0.6–156 months) after biopsy, and 272 participants developed renal endpoints (Fig. 4a and Supplementary Figs. 8 and 9). A Cox proportional hazards (CPH) regression  model was built for the KIDS to predict the progression of CKD with confirmed pathological types and PSG classification. The KIDS achieved an average C-index of 0.811 (95% CI: 0.800, 0.822), and the AUCs of the time-dependent receiver operating characteristic (ROC) curves were 0.854 (95% CI: 0.810, 0.897) at 1 year, 0.845 (95% CI: 0.811, 0.879) at 3 years and 0.821 (95% CI: 0.786, 0.857) at 5 years (Fig. 4c). The Kaplan‒Meier curves for the risk stratification from the model are illustrated in Supplementary Fig. 10. As shown, the KIDS had good discrimination ability in stratifying patients into low-, medium- and high-risk (risk score <Q1, Q1–Q3, >Q3) subgroups (p < 0.0001).

Fig. 4: Progression of CKD in different pathological types and performance of Cox proportional hazards regression models in progression prediction.
Fig. 4: Progression of CKD in different pathological types and performance of Cox proportional hazards regression models in progression prediction.The alternative text for this image may have been generated using AI.
Full size image

a Kaplan−Meier curves for the progression of CKD according to different pathological types. IgAN IgA nephropathy, MN idiopathic membranous nephropathy, DN diabetic nephropathy, ANS arterionephrosclerosis, MCD/FSGS idiopathic minimal change disease and focal segmental glomerulosclerosis. b Kaplan−Meier curves for risk stratification from the Cox proportional hazards regression model with predicted pathological types from the image-only AI model. Survival curves represent the high-risk, medium-risk, and low-risk subgroups (risk score <Q1, Q1–Q3, >Q3), and 95% CI regions are represented as shaded areas around the curve. ROC curves showing the performance of Cox proportional hazards regression models at 1, 3, and 5 years with 10-fold cross-validation using c pathological types from renal biopsy and d predicted pathological types from the image-only AI model. The P value was calculated via the log-rank test among all groups. AUC area under the curve, CI confidence interval. Source data are provided as a Source Data file.

Moreover, to assess the clinical importance of retinal image data in predicting CKD prognosis, we employed the probability of pathological classification and the probability of a high rate (≥75%) of glomerulosclerosis based on retinal images as a substitute for the renal biopsy results to construct an alternative risk prediction model for the KIDS. Risk stratification extracted from the KIDS resulted in significant separations of the low-, medium- and high-risk groups (p < 0.0001), and the Kaplan‒Meier curves are illustrated in Fig. 4b. The CPH models from the KIDS achieved an average C-index of 0.804 (95% CI: 0.783, 0.824) and AUCs of 0.852 (95% CI: 0.811, 0.892) at 1 year, 0.833 (95% CI: 0.799, 0.866) at 3 years and 0.809 (95% CI: 0.772, 0.846) at 5 years for the time-dependent ROC curves (Fig. 4d).

The performance of the KIDS and nephrologists in the diagnosis of different types of CKD

We used a multicenter prospective dataset labeled with renal biopsy results for a human‒machine comparison to preliminarily assess the differences in pathological diagnostic ability between the KIDS and nephrologists. The demographics and clinical information of the patients with different pathological types are listed in Supplementary Table 5. The test dataset was labeled by nine nephrologists with varying levels of clinical experience in China and three nephrologists in Somalia. The performance of the KIDS was subsequently compared with that of nephrologists via ROC curve plots. As shown in Fig. 5C, the plots revealed that the KIDS achieved greater sensitivity than did all the other nephrologists, except for the diagnosis of ANS, for which the sensitivities of the two experts were slightly greater than those of the KIDS. In addition to accuracy, the KIDS also exhibited a greater advantage in terms of stability.

Fig. 5: Description of the clinical application scenarios for the KIDS and human and artificial intelligence comparison studies.
Fig. 5: Description of the clinical application scenarios for the KIDS and human and artificial intelligence comparison studies.The alternative text for this image may have been generated using AI.
Full size image

A The KIDS system can be applied in the following two clinical scenarios: scenario 1 involves CKD screening based on retinal images. When a participant arrives at a primary care setting, the KIDS can identify potential CKD cases from retinal images and recommend referrals for suspicious cases to specialized clinics, enabling early detection and timely intervention for CKD patients. Scenario 2: Noninvasive pathological diagnosis. In specialist nephrology clinics, the KIDS is expected to provide objective pathological diagnosis and predict adverse outcomes without the need for invasive renal biopsy. This information can assist nephrologists in making precise treatment decisions and managing patients with personalized care. B A human and artificial intelligence comparison study was performed to evaluate the performance of the KIDS compared with that of nephrologists in diagnosing pathology via ROC curves. C ROC curve plots revealed that the KIDS achieved greater sensitivity than did all the other nephrologists, except for the diagnosis of ANS, for which the sensitivities of the two experts were slightly greater than those of the KIDS. In terms of accuracy, KIDS also exhibited a greater and more stable advantage. KIDS kidney intelligent diagnosis system, ROC receiver operating characteristic, IgAN IgA nephropathy, MN idiopathic membranous nephropathy, DN diabetic nephropathy, ANS arterionephrosclerosis, MCD/FSGS idiopathic minimal change disease and focal segmental glomerulosclerosis. CN-R Chinese resident nephrologist, CN-S Chinese senior nephrologist, CN-E Chinese expert nephrologist, SO-R Somali resident nephrologist, SO-S Somali senior nephrologist, SO-E Somali expert nephrologist, SO-N Somali nephrologists. Source data are provided as a Source Data file.

Model visualization and explanation

Saliency maps were used to enhance the interpretability of the AI models by highlighting key areas in retinal images. These changes were related to abnormal exudation, vascular rigidity, and a greater cup-to-disc ratio in CKD patients (Supplementary Fig. 11a–c). Distinctive features of various pathological types have been identified, including the presence of microaneurysms and exudation in the DN, vascular narrowing and papilledema in the ANS, and increased retinal drusen in IgAN. Conversely, the retinopathy observed in MN and MCD/FSGS was relatively mild (Supplementary Fig. 11d–f), which suggested that the signals utilized for prediction may be correlated with the severity of vascular damage associated with CKD of different etiologies. Clinicians and researchers are expected to gain a better understanding of the underlying mechanisms and potential treatments for CKD-related retinal pathology.

A detailed explanation of feature selection and interpretation of the CPH model used in predicting CKD progression is displayed in Supplementary Fig. 12. On the basis of the CPH model, a nomogram was developed as an intuitive and visual method for the individualized prediction of CKD progression in patients without renal biopsy data for clinical use. An example of the nomogram using the predicted probability of pathological type and high PSG is shown in Supplementary Figs. 13 and 14.

Discussion

In this study, the KIDS was developed to implement comprehensive process management in CKD diagnosis and treatment decision-making, including (1) detecting CKD in populations, (2) conducting pathological diagnosis and staging in identified CKD patients, and (3) predicting disease prognosis on the basis of the pathological diagnosis. When a participant arrives at a primary care setting, the KIDS can identify potential CKD cases from retinal images and recommend referrals for suspicious cases to specialized clinics, enabling early detection and timely intervention for CKD patients. In specialist nephrology clinics, the KIDS is capable of providing objective pathological diagnosis and predicting adverse outcomes without the demand of renal biopsy, assisting nephrologists in precise treatment decision-making and personalized management (Fig. 5A). The KIDS is expected to play a significant role in improving kidney care and preventing kidney failure, especially in less developed regions with limited medical resources.

Current methods used to evaluate renal function and detect CKD, such as clinical information, imaging techniques (ultrasound, computed tomography, and magnetic resonance imaging)30,31, and routine laboratory tests (blood and urine tests)11,32,33,34, may not be specific enough to differentiate pathological types of CKD. Renal biopsy remains the gold standard for pathological diagnosis, and studies have suggested that prebiopsy and postbiopsy diagnoses differ in up to one-third of patients, highlighting the utility and significance of renal biopsy for patients with unclear clinical conclusions35. Although early biopsy-guided initiation of therapy can prevent complications of decline in kidney function and mitigate disease progression8, the limited availability of specialists, strict indications, and accompanying complications hinder accessibility to renal biopsy8,9,10,11,36,37,38. Percutaneous kidney biopsy procedures are not routinely performed, particularly in low- and middle-income countries (LMICs) and regions, and they may be associated with high complication rates (reported to be as high as 52.6% in these regions) and a lack of nephrologists39. In particular, renal biopsy is rarely an option in the least developed regions, such as Somalia. Therefore, a noninvasive approach with good compliance that can provide a pathological diagnosis of CKD and guide clinical treatment will be the state-of-the-art in CKD management, and will potentially improve the basic quality of kidney care in LMICs. Retinal photography is the most common examination in ophthalmology, and has the advantages of convenience, affordability, and noninvasiveness. In the U.S., the cost of conventional CKD screening is approximately $80, approximately five times greater than that of retinal photography. This indicates that CKD screening using retinal images is more cost-effective and pragmatic40,41, making it suitable for CKD screening and diagnosis in LMICs.

Advances in CKD screening using DL algorithms based on fundus images have been achieved previously27,28. Similarly, we have also demonstrated that AI models employing retinal images serve as attractive tools for CKD screening and provide high accuracy. More importantly, the major strength of this study is the ability of the AI model to provide both pathological diagnosis and prediction of CKD based on retinal images. The KIDS can identify the five most common pathological types of CKD from retinal images. Among these, the identification of DN and ANS achieves higher AUCs, which is likely attributable to the obvious microangiopathy in the retina caused by hyperglycemia or hypertension42,43,44. When clinical data such as routine blood, urine, and kidney ultrasound data were combined, the AUCs of the hybrid models reached 0.864–0.976. To assess the effect of adding retinal images to hybrid models, we quantified the improvement of model performance via the NRI45,46. We found that the use of retinal images in addition to clinical data markedly enhanced the pathological prediction (NRI: 0.491–1.134). Although NRI has faced concerns over poorly fitted models, NRI can still be used when the models are well-calibrated. In this study, our analysis, including calibration plots and Hosmer-Lemeshow goodness-of-fit tests, demonstrates that the models exhibit strong calibration. Herein, the use of NRI is appropriate in this context, as it provides valuable insight into model reclassification after calibration.

Furthermore, the model’s robustness and generalization were validated across multicenter and multi-ethnic external datasets, showing consistent and satisfactory performance across diverse populations. Notably, although the Shanxi dataset lacked retinal images, the clinical data-only model was used for pathological diagnosis and demonstrated strong performance with excellent AUC values (0.790–0.932). This indicates the advantages of the KIDS system in accommodating different data types, ensuring that even in the absence of retinal images, accurate diagnostic predictions can still be achieved using clinical data alone. In Somalia, one of the least undeveloped countries, kidney care is extremely limited, making it challenging to perform a renal biopsy and to locate standardized and comprehensive medical records. The results of human-machine comparisons also highlighted the poor diagnostic capabilities of nephrologists in Somalia, underscoring the urgent need to provide assistance in pathological diagnosis and improve the level of kidney care. Herein, we simplified the diagnostic model to fit the clinical data available in Somalia. This simplification resulted in AUCs ranging from 0.630 to 0.870, demonstrating the model’s effectiveness in providing accurate diagnosis despite limited existing data. These results highlight the model’s generalizability and accessibility, showcasing its great potential advantages for basic-level hospitals that lack access to renal biopsies, especially in undeveloped regions. In addition, PSG, which is based on the creatinine value and varies with metabolism, medications, and calculation methods, was used as an estimate for chronic kidney injury, revealing features that are more objective and stable than the eGFR47,48. Therefore, we also developed a clinical data-based pathological staging model in the KIDS to identify patients with high PSG prior to renal biopsy. It can help nephrologists decide whether a renal biopsy is needed for a CKD patient, thus preventing patients with severe sclerosis from sustaining undesirable damage and financial burden.

Accurate progression prediction plays an important role in facilitating specific management and improving the outcomes of CKD patients49,50. Most previous AI models have provided prognostic predictions for only one pathological type of CKD, including the time-to-ESRD or high creatinine value within 5 years51,52,53. Additionally, these models are always targeted at patients with proven biopsies. In this study, we confirmed that the prognoses of CKD patients are related to their pathological types and high PSG values (≥75%). Moreover, we developed the KIDS to predict the time point of deterioration of renal function and the outcome events within 1, 3, and 5 years, including a 50% decrease in the eGFR, renal transplantation, and the initiation of dialysis.

The KIDS exhibited robust performance and generalizability in the prospective dataset, accurately identifying pathological types and even outperforming experienced nephrologists. Considering the lack of renal biopsy in the least developed regions, the KIDS can serve as an alternative diagnostic method and is expected to significantly improve the level of kidney care. Our system has three unique advantages. First, on the basis of existing renal biopsy data, Kaplan‒Meier curves and a CPH regression model were established in the KIDS, which can be used for risk stratification and prognosis prediction of the five pathological types of CKD, and show good performance. Second, in the absence of a renal biopsy, the KIDS can take advantage of the probability of pathological diagnosis for prognosis prediction. Third, the KIDS is simple and easy to use, exhibiting good accessibility. All the input variables used in the prediction models are from routine tests in clinical practice. Additionally, a nomogram was drawn for the individualized prediction of CKD progression in patients, which is simple and effective for physicians.

Our study has several limitations that we aim to address in future research. First, since the pathological diagnosis models were trained using the data from CKD patients who underwent renal biopsy to ensure the accurate and effective prediction of pathological types of CKD, the sample size was limited because of the invasiveness and accessibility constraints of renal biopsy. Nevertheless, our study included a greater cohort of participants than did an analogous study involving renal biopsy, where the number of participants usually remained below 200054. Regarding the quality of data collection, the retinal images collected in external tests have insufficient brightness, clarity, and integrity, and the differences observed in retinal images acquired from various instruments can markedly impede the model’s performance. To solve these issues, we have enhanced the ability of KIDS to process retinal images from different instruments. Through validation with retinal images from different instruments at external test centers, the models showed good generalizability. Finally, because our study cohort included Chinese and Somali patients, the generalizability in a wider range of countries and ethnicities will need to be validated. To enhance the clinical utility of the KIDS system across more regions and populations, we have implemented its deployment on a cloud platform (https://zockids.gzzoc.com/modelstore/#/login), enabling broader accessibility and application. Model training on more diverse clinical and demographic cohorts may further improve the diagnostic accuracy and generalizability for its clinical use. Additionally, we also aim to expand the model’s capability to predict a wider range of pathological types. Future studies will also focus on the histological characteristics of CKD to examine their prognostic relevance.

In conclusion, the KIDS is not only suitable for CKD screening of the population, but it also provides accurate diagnostic and prognostic information for individual patients without the requirement for invasive operations. The accuracy and stability of our model in pathological diagnoses are better than those of human doctors, which is extremely beneficial for the CKD patients who cannot receive a renal biopsy. In addition, due to the ease of access to retinal images or clinical data in clinical practice, this noninvasive model shows great potential for adoption for the comprehensive management of CKD and improving kidney care.

Methods

This was a multicenter, retrospective cohort study, and the KIDS was developed and validated utilizing retinal images and clinical data from participants across 8 hospitals in China and 1 hospital in Somalia. The dataset collected from FAH and ZOC was used for model development and internal, prospective testing. External testing was performed using datasets from ZPH, FPH, and AHY. Additionally, small-scale real-world validation was conducted using multicenter and multi-ethnic datasets from FPHK, SPTCMI, SAH, and BH. The study was registered at ClinicalTrials.gov (identifier: NCT05223712) and approved by the Institutional Review Board/Ethics Committee of ZOC (2021KYPJ167), the ICE for Clinical Research and Animal Trials of FAH (2022097), the Ethics Committee for Clinical Research and Experimental Animals of ZPH (K2022-162-2), the Medical Ethics Committee of FPH (2022-90), the Medical Ethics Committee of AHY (YYFY-LL-2022-106), the Medical Ethics Committee of SPTCMI (2025Y-08001), the Medical Ethics Committee of FPHK (2024-69), the Medical Ethics Committee of SAH (2024-256), and the Research Committee of BH (BH/IRVD2024007/001). The study was conducted in accordance with the Declaration of Helsinki. Informed consent was obtained from participants prospectively enrolled at the FAH for model prospective testing. For all other cohorts, informed consent was waived by the respective institutional review boards due to the retrospective design and use of de-identified data. All ethics committees listed above reviewed and approved the study protocol and consent procedures. (Supplementary Fig. 1).

Diagnostic criteria

According to the international guidelines of the Kidney Disease: Improving Global Outcomes (KDIGO), CKD was defined as either an eGFR <60 mL/min/1.73 m2 or kidney damage (albuminuria) for 3 months or longer5. Non-CKD controls were defined as individuals with an eGFR ≥ 60 mL/min/1.73 m2, absence of albuminuria, and no documented history of CKD. CKD was categorized into early CKD (eGFR ≥ 60 mL/min/1.73 m2 with albuminuria), moderate CKD (eGFR 30–59 mL/min/1.73 m2), and advanced CKD (eGFR < 30 mL/min/1.73 m2)55.

The pathological diagnosis was established through collaborative assessments by nephrology and pathology specialists, based on renal biopsy results (Supplementary Methods). We developed prediction models for identifying the five most common types of CKD, whose distribution is consistent with epidemiological surveys of the biopsy-proven spectrum of kidney diseases in China56,57: (1) IgAN: IgA nephropathy; (2) MN: idiopathic membranous nephropathy; (3) ANS: arterionephrosclerosis; (4) DN: diabetic nephropathy; and (5) MCD/FSGS: idiopathic minimal change disease (MCD) and focal segmental glomerulosclerosis (FSGS). MCD and FSGS are grouped as podocytopathies owing to their shared pathological characteristics and mechanisms involving podocytes58. The definitions and management principles of these 5 pathological classifications are described in the Supplementary Material. The PSG is a crucial prognostic marker for long-term implications that is used for assessing the severity of kidney diseases. In the pathological staging model, glomerular lesions with ≥75% sclerotic glomeruli are classified as having high PSG, indicating substantial glomerular damage and fibrosis59.

The composite renal endpoint event was defined as a renal outcome if it met any of the following criteria: I. deterioration in kidney function, defined as a ≥ 50% decline in eGFR from baseline (as measured by serum creatinine at admission and follow-up); II. development of end-stage renal disease (eGFR < 15 mL/min/1.73 m2); and III. initiation of renal replacement therapy (dialysis or kidney transplantation)60. Kidney survival was calculated after diagnostic renal biopsy to renal endpoints.

Clinical and image datasets

Dataset for the CKD screening AI model

For CKD detection, we used retinal images of CKD patients and non-CKD participants from retrospective datasets. The development and internal test dataset consisted of non-CKD participants who had annual health examinations, including routine systemic (albuminuria and creatinine tests) and ophthalmic tests at the Health Examination Center, and CKD patients from the Department of Nephrology of FAH between January 3, 2009, and February 28, 2023. The external test set included participants who underwent an annual health check at ZOC and outpatients at ZPH, FPH, and AHY from December 23, 2016, to March 13, 2024.

Dataset for the pathological diagnosis and staging AI model

Patients who met the diagnostic criteria for CKD and underwent both ophthalmological examinations with retinal imaging and kidney specialist examinations with renal biopsy during the same hospitalization were included in the study. The development and internal test datasets were collected from patients of the Department of Nephrology of FAH between April 28, 2009, and August 8, 2022. Validation datasets were provided by a prospective cohort collected in the Department of Nephrology of FAH from August 8, 2022, to April 27, 2023.

For external testing, we obtained 3 independent retrospectively collected datasets: (1) Zhongshan City People’s Hospital [ZPH, Zhongshan City, China]: patients from December 23, 2016, to November 7, 2022; (2) First People’s Hospital of Foshan [FPH, Foshan City, China]: patients from January 13, 2021, to February 28, 2023; and (3) Affiliated Hospital of Youjiang Medical University for Nationalities [AHY, Youjiang City, China]: patients from May 7, 2022, to February 3, 2023. These hospitals are located in three cities across two provinces in southern China, each representing different economic and medical profiles. All these datasets consisted of retinal images and clinical data, including blood and urine test results, kidney ultrasound scans (as shown in Supplementary Table 2), pathological diagnosis, and PSGs derived from renal biopsies.

For real-world validation across multinational and multi-ethnic settings, we collected four datasets: (1) First People’s Hospital of Kashi [FPHK, Kashi, China], spanning December 14, 2020, to November 10, 2024; (2) Shanxi Provincial Traditional Chinese Medicine Institute [SPTCMI, Taiyuan, China], from December 26, 2018, to September 29, 2024; (3) The Second Affiliated Hospital of Xi’an Jiaotong University [SAH, Xi’an, China], covering November 11, 2020, to December 21, 2023; and (4) Banadir Hospital [BH, Mogadishu, Somalia], from July 20, 2023, to November 18, 2024. Due to constraints in local medical resources, the datasets from SPTCMI and BH comprised only clinical data.

Dataset for progression prediction models

For the CKD progression prediction AI model, CKD patients (eGFR ≥ 15 mL/min/1.73 m2 and without dialysis) before the kidney failure stage from the retrospective cohort were included. Administrative and follow-up data from the retrospective cohort were collected for development and validation. Participants were monitored for disease outcomes, including renal dialysis, kidney transplantation, hospitalization, mortality, and cause of death, from May 12, 2009, to June 22, 2022. Follow-up information regarding the participants’ examinations after the initial assessment was obtained through electronic health records, phone interviews, or clinical visits. The drop-out time and reason for the missed visits were recorded. Participants with preexisting renal dialysis or kidney transplantation at the time of administration were excluded from the cohort analysis (Supplementary Methods).

Image acquisition and data quality control

The retinal images of all the datasets consisted of one macula-centered fundus image per eye with a 45° primary field of view. For the datasets collected from FAH, the retinal images were captured with CR-2 AF (Canon) and RetiCam 3100 (SYSEYE). For the external testing dataset, retinal images from ZPH, FPH, and AYH were obtained via KOWA Nonmyd 7 (Kowa), AFC-330 (NIDEK), and CR-2 AF (Canon), respectively. The retinal images included in the analysis met the following criteria: integrity of the macula and optic disc, absence of artifacts, and sufficient resolution. For each subject, one retinal image per eye that met the specified criteria was selected. For subjects with only one eligible eye, the best image from that eye was retained. Out of a total of 15,127 images collected, we excluded 1983 (13.1%) ineligible fundus images.

Model development

CKD screening AI model

The model was developed based on retinal images. The images were assigned randomly to training, validation, and test sets at a ratio of 8:1:1 at the participant level, and there were no samples overlapping in the three datasets. In this study, the model’s input comprised retinal images from both the right and left eyes, and the images were processed separately. Predictions were made with the AI model at the image level, and then the image-level outputs were averaged at the participant level as a final prediction for each participant. For participants whose only one eye image was eligible, the retinal image was processed, and its model output was taken as participant-level prediction. To further evaluate the performance of the CKD screening DL model when applied to populations with different prevalences of CKD, we conducted tests based on the internal test and external test datasets, with simulated prevalences of 5%, 10%, and 20%. We created 1000 simulated datasets by randomly sampling 1000 × (1−prevalence) non-CKD and 1000 × prevalent CKD samples from each test dataset with replacement for each prevalence.

A noninvasive model for pathological diagnosis

To identify the presence of the five categories of pathological diagnosis, we trained a noninvasive model consisting of three submodels to perform five separate binary classification tasks: (1) a retinal image model; (2) a clinical data model; and (3) a hybrid model. In the retinal image-only model, the input comprised retinal images, specifically one macular-centered image per eye, and the output was the probabilities of the presence of the five pathological categories. The clinical data model was first built via a multilayer perceptron (MLP; 56, 256, 256, 5) with full features. Then, the importance of the features was measured via the permutation feature importance method on the validation set, and features with 95% lower confidence limits >0 were selected. Finally, the MLP (35, 256, 256, 5) model with the selected features and the highest accuracy on the validation set was constructed as the final meta-model. In addition, we developed a simplified version of the meta-model for application based on tests and ultrasounds that may be lacking in the local medical conditions. For the hybrid model, we developed an EfficientNet-B5-based multimodal model by integrating retinal images and clinical data as inputs. The average of the probabilities of each category for both eyes from a patient was calculated as the probability of the presence of the category at the participant level. (Supplementary Fig. 2).

Pathological staging model

The pathological staging model was built via XGBoost with features selected via least absolute shrinkage and selection operator with 10-fold cross-validation. The datasets used were the same as those used in the aforementioned clinical data model, which requires 6 clinical features as inputs: hemoglobin, creatinine, uric acid, and renal ultrasound measurements (left renal length; whether the demarcation of the cortex is clear; and whether the medulla and hyperechoic medulla are greater). The output was the probability of a PSG score ≥ 75%.

Progression prediction models

The pathological types confirmed by renal biopsy, combined with clinical features, were used to develop models for predicting the renal endpoints. We also used the predicted probabilities of pathological classification from retinal images and the predicted probability of high glomerulosclerosis (≥75%) results instead of renal biopsy results to develop models with the same methods. CPH regression and 10-fold cross-validation were adopted to train and validate the models. For risk stratification, the individual risk scores were calculated with the models. Patients were subsequently categorized into three groups according to upper and lower quartiles: low-risk (risk score < Q1), medium-risk (Q1–Q3), and high-risk (risk score > Q3) according to a previous study28.

Model visualization and explainability

As a graphical representation of the prognostic prediction model based on Cox regression, the nomogram consists of a set of axes representing the distribution of selected variables, with corresponding points assigned to different levels or values of each variable. The points for each variable are added to obtain a total score, which is then translated into a predicted probability of CKD progression at 1, 3, and 5 years on separate axes. Furthermore, to improve the interpretability of the models, a forest plot was adopted for the Cox regression models, and the importance of the features in the final meta-model was measured via the permutation feature importance method on the validation set. A SmoothGrad saliency map61 was used to draw heatmaps by highlighting pixels that strongly influenced the prediction of the models for the images in the test datasets.

The details of model development are provided in the Supplementary Methods.

Comparison between the KIDS and clinical nephrologists in a prospective multicenter validation

We designed an AI and clinical nephrologists’ comparison study to evaluate the performance of the KIDS compared with that of nephrologists in diagnosing pathological types of CKD and to assess the real-world applicability of the KIDS in a clinical setting. The test set prospectively enrolled 256 CKD patients, none of whom had been used in training and validation before, from multiple centers (FAH, FPH, ZPH, and AHY) with single or multiple pathological diagnoses confirmed by renal biopsy between May 1, 2023, and November 22, 2023. All the cases were subsequently assessed via the KIDS and individually by nine nephrologists at different levels of expertise in China: three at the resident level (X.L. [FPH], Q.H. [ZPH], Y.X. [AHY]), three at the senior level (L.J. [FAH], P.Y. [FPH], S.Y. [Dongguan TCM hospital]), and three at the expert level (N.H. [FAH], H.Y. [FAH], J.Y. [FAH]). Three nephrologists, I.M.A. (resident level), N.I.A. (senior level), and M.H.R.H. (expert level) from Banadir Hospital in Somalia, also conducted pathological diagnoses of CKD patients. Nephrologists were asked to determine the pathological type on the basis of the clinical data and retinal images provided for each case. We used ROC curves and accuracy (ACC) values to compare the pathological diagnostic performance of the KIDS and nephrologists.

$${Accuracy},{ACC}=\frac{1}{n}{\sum }_{i=1}^{n}\frac{\left|{Y}_{i}\cap {Z}_{i}\right|}{\left|{Y}_{i}\cup {Z}_{i}\right|}$$

The ACC for each patient is defined as the proportion of correctly predicted pathological diagnosis labels to the total number of labels (both predicted and actual) for that patient. The ACC for the test dataset is calculated as the average ACC across all 256 patients (n: total number of patients; Yi: actual label set, Yi = \(({y}_{1},{y}_{2}\cdots \cdots {y}_{k}),\,{y}_{j}\,\in \,\left\{{\mathrm{0,1}}\right\},\,1\le j\le k,{k\; is\; the\; number\; of\; labels}\); Zi: predicted label set, Zi = \(({z}_{1},{z}_{2}\cdots \cdots {z}_{k}),{z}_{j}\,\in \left\{{\mathrm{0,1}}\right\},\,1\le j\le k\))62,63.

Statistical analysis

The performance of the models in CKD screening and pathological diagnosis and staging for binary classification prediction was assessed via ROC curves, in which sensitivity was plotted against 1-specificity at different probability thresholds. The AUCs are reported along with 95% DeLong confidence intervals (CIs). The paired DeLong test, continuous NRI, and IDI were employed to assess the improvement in pathological classification performance by adding retinal imaging information. Bonferroni correction was adopted for multiple AUC comparisons. Calibration plot and Hosmer-Lemeshow goodness-of-fit test were used to assess the consistency between the observed probabilities and the probabilities predicted by the metadata model and hybrid model. Using the optimal operating thresholds determined with the Youden index, the sensitivities and specificities for each type of pathological diagnosis made by the KIDS were measured and visualized via ROC curves to compare the KIDS with those of clinical nephrologists. Kaplan‒Meier curves and log-rank tests were used to compare the overall survival probability of CKD patients in different pathological types and groups. The concordance index (C-index) was used to assess the overall predictive accuracy of the CPH model. Time-dependent ROC curves with AUCs and 95% CIs at 1 year, 3 years, and 5 years were also used to measure model performance. All the statistical analyses were performed via Python (version 3.8.0) and R (version 4.1.1).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.