Fig. 1: An overview of analytic workflow of the study.

a We first extracted a dataset of 32,501 newly diagnosed individuals withT2D from the CRDS, a large-scale electronic health record database comprising over 8.6 million individuals with detailed demographic, laboratory, diagnostic, and surgery data. To evaluate disease progression and heterogeneity, we initially performed model adaptation by mapping each individual to a previously defined “Scottish tree” of diabetes subgroups based on 9 clinical variables: HbA1c, BMI, total cholesterol (TC), HDL-C, alanine aminotransferase (ALT), creatinine (Cr), systolic blood pressure (SBP), triglycerides, and diastolic blood pressure (DBP). b To identify the most informative clinical features for defining T2D subtypes in the Chinese population, we applied a Variational Autoencoder (VAE) model feature selection framework. The process resulted in 10 key variables: HDL-C, triglycerides, SBP, ALT, HbA1c, LDL-C, Cr, heart rate, DBP, and BMI. These original 10 variables were used for constructing a Chinese-specific diabetes tree. c The Chinese tree was developed using 80% of the CRDS T2D cohort as a training set and validated internally in the remaining 20%. External validation was performed using two independent cohorts: the Joint Asia Diabetes Evaluation (JADE) registry and the DR cohort. Consistency of phenotypic and complication profiles across cohorts confirmed the robustness of the Chinese tree. An interactive web-based tool was also developed for clinical application and research usev (https://wenglab-t2d-phenotype.shinyapps.io/wenglab-t2d-phenotype/). (Created in https://BioRender.com).