Abstract
Prediabetes is characterized by elevated blood glucose levels that are higher than normal but below the threshold for diabetes mellitus. While AI models have been used for prediabetes prediction, most rely solely on standard clinical and biochemical markers. This study introduces a novel Pattern Neural Network (PNN) model that uniquely integrates total antioxidant scavenging potential with traditional risk factors, providing new insights into the role of oxidative stress in prediabetes risk stratification among Indian adults. A total of 199 individuals aged 18 to 60 years were recruited and classified based on HbA1c levels into Control (n = 99) and Prediabetes (n = 100) groups. Fourteen input features including age, gender, total antioxidant status, HbA1c, FBG, OGTT, TGL, HDL, LDL, VLDL, TC, WC, Hb, and BMI were used to train a PNN with 14 input nodes, 10 hidden nodes, and one output node. The dataset was randomly divided into training, validation, and testing subsets. Model performance was compared against SVM, KNN, and LR classifiers. Feature importance analysis was conducted to interpret the model’s clinical relevance. The PNN model achieved superior validation performance with an accuracy of 98.3%, outperforming SVM (96%), KNN (83%), and LR (71%). Notably, antioxidant scavenging potential and waist circumference emerged as the most influential predictors. The model’s output value of 0.8770 (threshold > 0.5) effectively identified individuals at increased risk of developing diabetes. By incorporating oxidative stress markers, this study provides the first AI model in an Indian cohort to link antioxidant status to prediabetes risk. The PNN model demonstrates excellent accuracy and interpretability, offering a clinically actionable tool for early disease detection and personalized intervention.
Introduction
Prediabetes is a significant risk factor for developing diabetes, particularly in individuals with Impaired Fasting Glucose (IFG) and Impaired Glucose Tolerance (IGT)1. Studies indicate that approximately 5–10% of people with prediabetes progress to diabetes each year, while a similar percentage revert to a healthy state2. In today’s healthcare landscape, early detection and management of chronic conditions like prediabetes are essential to prevent the development of more severe complications, such as type 2 diabetes.
Traditional methods for diagnosing prediabetes typically rely on a combination of clinical tests and subjective assessments, which may take longtime, expensive and may not accurately predict individual risk. Therefore, artificial intelligence (AI) models have emerged as promising tools for early prediction in various diseases4. AI techniques, particularly those based on machine learning and pattern recognition, can analyze large datasets to identify subtle patterns and risk factors that might be overlooked by conventional approaches. By integrating data from electronic health records, wearable devices, and patient surveys, AI models can offer a comprehensive and personalized risk assessment for developing prediabetes14.
The application of AI in prediabetes prediction offers several advantages, including improved accuracy, earlier intervention, and the ability to customize preventive strategies according to individual needs. This not only enhances patient outcomes but also has the potential to reduce healthcare costs by preventing the progression to type 2 diabetes and its associated complications5. As AI technology continues to advance, its role in predicting prediabetes represents a promising frontier in preventive medicine, providing new hope for at-risk populations and transforming the approach to managing chronic diseases.
This study focuses on the early prediction of prediabetes using an artificial intelligence model. Here, we describe how clinical data were used to develop an AI model for predicting prediabetes, using mathematical derivation. Though Previous literature has demonstrated various models for predicting prediabetes and diabetes, Our approach focuses on identifying the best model that not only accurately predicts the condition but also aligns closely with relevant clinical parameters. Additionally, our model was optimized using gradient descent with a stopping criterion to ensure robust performance.
The key contributions of this study are as follows:
-
(a)
The clinical data was collected by health professionals from the patients attending Department of community medicine at GVMCH.
-
(b)
The biochemical analysis for all the samples was carried out and then categorised as prediabetes and control on the basis of HbA1c.
-
(c)
To initiate the AI process, the clinical data served as input features and were labelled as 0 for control subjects and 1 for individuals with prediabetes. Using these labels, the data were processed in MATLAB, yielding an output value of 0.8770. An output value greater than 0.5 was classified as prediabetes, while values below 0.5 were considered as controls.
-
(d)
Pearson correlation analysis identified the strongest predictors (BMI, WC, HDL, LDL, TGL, TC, Hb, age, FBG, OGTT, VLDL and total antioxidant status) of prediabetes. These predictors were then used to develop the model, which was mathematically validated as the optimal model for prediabetes prediction.
-
(e)
Interestingly TAS was observed to statistically significant (p < 0.0001) in this study and was a novel addition to the prediction for prediabetes. A decrease in total antioxidant status indicates increased oxidative stress, which is associated with disease progression.
This study approach will provide a clear understanding of the application of artificial intelligence in prediabetes prediction and offers an insight into the underlying clinical parameters. Early prediction of prediabetes enables timely intervention to prevent progression to diabetes. Moreover, AI models demonstrate strong capability in identifying critical features, making this network a promising tool for accurate prediabetes prediction in the clinics or hospitals.
By incorporating oxidative stress markers, this study provides the first AI model in an Indian cohort to link antioxidant status to prediabetes risk. The PNN model demonstrates excellent accuracy and interpretability, offering a clinically actionable tool for early disease detection and personalized intervention. Our approach not only achieves the state-of-the-art predictive (shown in Table 1) accuracy but also provides mechanistic insights into the role of oxidative stress in prediabetes, paving the way for targeted interventions.
Methodology
As a pilot study, a total of 199 individuals aged 18 to 60 years were recruited. Based on HbA1c levels, participants were classified into Control (n = 99) and Prediabetes (n = 100) groups. The study received ethical approval from the Institutional Ethics Committee for Human subjects from Vellore Institute of Technology, Vellore (VIT/IECH/XIV/2023/01), Government Vellore Medical College & Hospital, Vellore (GVMCH) (VMC/III/00001/2023), and the Directorate of Public Health and Preventive Medicine, Chennai (DPH) (DPHPM/DPHSAC/2023/250). All participants were informed about the study’s objectives and procedures, and written informed consent was obtained prior to enrolment.
Peripheral blood samples (6 mL) were collected from each participant after an overnight fasting of 8 hours. Samples were drawn into clot activator tubes for biochemical analysis and EDTA vacutainers for HbA1c measurement (Guo et al., 2014)27. HbA1c levels were quantified using the High-Performance Liquid Chromatography (HPLC) method (Bio-Rad D-10). Fasting Blood Glucose (FBG) and Oral Glucose Tolerance Test (OGTT) levels were determined using the glucose oxidase-peroxidase (GOD-POD) enzymatic method. Lipid profiles were assessed as follows: total cholesterol by the enzymatic cholesterol oxidase-esterase peroxidase (CHOD-POD) method, triglycerides by the glycerol-3-phosphate oxidase–phenol–aminoantipyrine (GPO-TOPS) method, and HDL, LDL, and VLDL cholesterol fractions were calculated using the selective inhibition method on an automated analyzer.
In addition to standard clinical and biochemical parameters, total antioxidant scavenging potential was measured for each participant using DPPH free radical scavenging assay28 The serum total antioxidant status was measured using 2,2-Dipheny 1-1-picryl-hydrazyl (DPPH) free radical scavenging assay. Standard concentration of Ascorbic acid (1–10µM) is a natural antioxidant also used as positive control and 0.1 M of DPPH was used as a working solution. The serum total antioxidant status was measured by reduction in the absorbance at 517 nm using UV spectrophotometer. The serum total antioxidant status was measured by reduction in the absorbance at 517 nm using UV spectrophotometer. Then TAS was calculated using percentage scavenging potential, \(\:\%\:of\:scavenging\:potential=1-\left(\frac{absorbance\:of\:sample}{absorbance\:of\:control}\right)*100\:\) and the reference range was: 20–60% for healthy individuals. This comprehensive feature set enables the model to capture a broader spectrum of risk determinants, including oxidative stress.
Artificial intelligence models
Artificial Intelligence (AI) has revolutionized healthcare, particularly in predicting conditions such as prediabetes. Among AI techniques, pattern recognition neural networks (PNNs) have proven especially effective in this domain. These networks analyze large volumes of data to identify intricate patterns and risk factors that may not be immediately evident to healthcare providers6. In this study, we employed PNNs optimized via gradient descent as classifiers to develop a predictive model. The PNN model was designed with 14 input nodes, 10 hidden nodes (Tanh activation), and a single output node (sigmoid activation). The dataset was randomly split into training, validation, and testing subsets. The performance of this model was then compared with other established machine learning algorithms, as detailed below:
Support vector machine (SVM)
Support Vector Machine (SVM) is a supervised machine learning algorithm widely used for classification and regression tasks. It functions by identifying the optimal hyperplane that separates data points into distinct classes within a high-dimensional feature space. By utilizing kernel functions, SVM effectively manages complex, non-linear relationships, making it particularly suitable for binary classification problems18.
In prediabetes prediction, SVM classifies individuals based on health indicators such as age, BMI, glucose levels, and other relevant features. By analyzing these parameters, SVM aids in identifying individuals at risk of developing diabetes or prediabetes17. This non-invasive and efficient approach facilitates early diagnosis and intervention. Research has shown that SVM models achieve high predictive accuracy when properly tuned with appropriate kernels and hyperparameters.
K-Nearest neighbors (KNN)
K-Nearest Neighbors (KNN) is a simple, yet powerful supervised learning algorithm used for both classification and regression. It predicts the class of a data point based on the majority label among its ‘k’ nearest neighbors in the feature space. Being non-parametric, KNN does not assume any specific data distribution, which enhances its flexibility across diverse applications19. In the context of prediabetes, KNN categorizes individuals by comparing their health metrics-such as age, BMI, and glucose levels-with those of other subjects in the dataset. This comparative approach allows KNN to effectively identify individuals at risk of prediabetes20. Due to its simplicity and interpretability, KNN is a valuable tool for early risk assessment and timely intervention.
Logistic regression (LR)
Logistic Regression is a classical statistical method for binary classification that estimates the probability of an outcome based on one or more predictor variables. It models the relationship between the dependent variable (e.g., presence or absence of prediabetes or diabetes) and independent variables (e.g., glucose levels, BMI, age) using a logistic function, producing output probabilities between 0 and 1. Logistic regression is widely applied in diabetes and prediabetes prediction to analyze clinical data and identify high-risk individuals21. By examining key parameters such as glucose levels, BMI, and age, logistic regression provides a straightforward, interpretable, and effective approach for risk stratification. Its extensive use in medical research underscores its utility in clinical decision-making.
Proposed model development
A pattern neural network (PNN) is a type of artificial neural network designed to identify patterns and regularities within data5. Pattern Neural Networks outperform typical Artificial Neural Networks partly because they naturally incorporate cross-entropy loss during the learning phase, making them better suited for classification tasks. Cross-entropy loss is particularly useful for classification problems since it assesses how well the model’s predicted probabilities match the genuine class labels, effectively determining accurate network’s predictions29. In this study, we employed a PNN architecture specifically tailored for pattern recognition (Fig. 1), comprising three key layers: an input layer with p = 14 neurons representing the selected features, a hidden layer with q = 10 neurons utilizing the hyperbolic tangent (Tanh) activation function to capture complex nonlinear relationships, and an output layer with a single neuron using the sigmoid activation function to produce a probability score between 0 and 1. Our model uses a Pattern Neural Network that stores each training sample as a “pattern” neuron and combined with a Bayesian decision rule. This exemplar-based approach rather than tuning abstract weight matrices gives it a unique, built-in pattern recognition mechanism for classifying prediabetes.
This configuration is well-suited for binary classification tasks, such as distinguishing between control and prediabetes cases. The dataset was randomly partitioned into training 70% (n = 139), validation 15% (n = 30), and testing 15% (30) subsets to optimize model performance and 10-fold cross-validation was performed on the training set for hyperparameter optimization. The PNN model, with its defined input, hidden, and output layers, was developed and trained to accurately predict prediabetes status.This study used mathematical representation to create a model, where the input vector was denoted as x = [x1 × 2,,…xp]T, where xi characterizes the i-th input feature. Then the output of the hidden layer is computed as
also, the Tanh activation function for the hidden layer is the Tanh function, applied elementwise:
where, \(\:{\varvec{t}\varvec{a}\varvec{n}\varvec{h}}^{\left(\varvec{z}\right)}=\:\frac{{\varvec{e}}^{\varvec{z}}-{\varvec{e}}^{-\varvec{z}}}{{\varvec{e}}^{\varvec{z}}+\:{\varvec{e}}^{-\varvec{z}}}\) then the output from the hidden layer is passed to the output neuron through the weight vector W (2).
y- indicate as the output of the network is given below:
The activation function for the output layer is the sigmoid function:
where, \(\:\varvec{\sigma\:}\left(\varvec{z}\right)=\frac{1}{1\:+\:{\varvec{e}}^{-\varvec{z}}}\)
Then through backpropagation, the networks trained the appropriate weights W(1), W(2) and biases b(1), b(2). During the training phase of the dataset, binary cross-entropy loss was used to minimize the loss of function, which is given below:
The proposed framework
Data
The samples were processed at the BMGRL, Vellore Institute of Technology (VIT) and the parameters were used to build a model. In this study, 14 parameters were considered to predict prediabetes clinically, age, gender, TAS, HbA1c, FBG, OGTT, TGL, HDL, LDL, VLDL, TC WC, Hb, and BMI. These parameters were considered as an input feature (p) and the hidden nodes were taken by random selection (q = 10).
Data preprocessing
Figure 2 illustrates the overall analytical pipeline and preprocessing steps implemented in this study. The model architecture included a hidden layer with 10 nodes (q = 10) to facilitate the learning of complex patterns.Data were randomly partitioned into three distinct sets: training, validation, and testing, with 139, 125, and 118 samples, respectively. All clinical and demographic data were carefully preprocessed, including the handling of missing values, outlier detection, and feature normalization. The training set was used to develop the Pattern Neural Network (Patternet) model, employing a cross-entropy loss algorithm for optimal prediction of prediabetes.
Model optimization
Gradient descent with stopping criteria was used as model optimization to minimize the loss function in neural networks by repetitively adjusting the network’s weights and biases, whereas the input dataset includes a loss function to minimize f(W), gradient of the function ∇f(W), initial weights w0, learning rate η, number of iterations mixites and validation data was (Xval, Yval). For a binary classification task, the network’s objective is to minimize the binary cross-entropy loss function, defined in Eq. (5).
Where, ytrue is the true label (0 or 1) and ypred is the predicted probability output from the network.
The gradient descent algorithms update the weights \(\:{W}^{\left(1\right)}\), \(\:{W}^{\left(2\right)}\)and biases \(\:{b}^{\left(1\right)}\), \(\:{b}^{\left(2\right)}\)by computing the gradient of the loss function with respect to the parameters listed below:
Model validation
The model was evaluated using a confusion matrix during training (139) dataset against the validation set. In this study, a confusion matrix was used to validate the training, validation, and test dataset (shown in Fig. 5). The validation data performance of the model was evaluated using four different matrices:
-
(a)
Precision: This is the proportion of patients with prediabetes, the positive instance, who are correctly identified as being prediabetic out of all the prediabetic patients and is comparable to positive predictive value in epidemiology and computed as the ratio of true positive (TP) to the sum of TP and false positive (FP).
-
(b)
Recall: This is the proportion of patients with prediabetes, the positive instances, who are correctly identified as being prediabetic and it is computed as comparable to sensitivity in epidemiology\(\:Recall\:or\:senstivity=\:\frac{TP}{FN+TP}\).
-
(c)
Accuracy: Ratio of the total number of predictions, and it is expressed negatives.
-
(d)
F1 score: This is the weighted average of precision and recall. As a result, this score considers both false positives and false negatives.
Here, TP represents true positive, FP- False positive, TN- True negative, and FN- False negative. ROC curve is used to evaluate the performance of a binary diagnostic classification method. Figure 6 shows the performance of the model by plotting the true positive rate (TPR) on the y-axis and the false positive rate (FPR) on the x-axis. The area under the ROC curve is known as AUC which implies the degree of separability that provides how much a model is capable of distinguishing between individuals. A higher value of AUC shows a better prediction model.
Data analysis
Mean and standard deviation for all measured clinical and biochemical characteristics were calculated using the student’s t-test (GraphPad). To visually summarize the data, box plots were employed, providing an effective graphical representation of the distribution and skewness of numerical variables15. Each box plot displays the five-number summary: minimum, lower quartile (Q1), median, upper quartile (Q3), and maximum. Figure 7 presents the box plot visualization of the key biochemical parameters.
Additionally, the Pearson correlation coefficient was used to assess the linear relationship between pairs of variables16. Pearson correlation analysis was performed on biochemical, clinical, and percentage scavenging potential data among individuals with prediabetes. The results revealed moderate to strong correlations among several parameters, as detailed in Table 5.
Results
Study participants
As depicted in Fig. 2, data from 199 eligible participants were analyzed in this study. The cohort included 99 individuals in the control group (44 males, 55 females) and 100 individuals with prediabetes (39 males, 61 females), and all aged between 18 and 60 years. Fourteen clinical and biochemical characteristics were assessed as potential predictors of prediabetes (Table 2), as clearly established risk factors in recent literature.
Descriptive statistics, including mean and standard deviation, were calculated for all variables using GraphPad Student’s t-test calculator, as presented in Table 2. Among the 14 evaluated characteristics, age, BMI, waist circumference (WC), percentage scavenging potential, HbA1c, and oral glucose tolerance test (OGTT) values demonstrated statistically significant differences between groups (p < 0.05). These findings are consistent with previous studies that underscore the importance of these parameters in identifying individuals at increased risk for prediabetes and highlight their value in predictive modeling and early intervention strategies.
Statistical analysis for biochemical parameters
Boxplot analysis
Figure 3 presents box plots illustrating the distribution of key biochemical parameters, including HbA1c, triglycerides (TGL), total cholesterol (TC), HDL, LDL, fasting blood glucose (FBG), oral glucose tolerance test (OGTT), and VLDL. Notably, the median HbA1c value for the prediabetes group lies outside the interquartile range of the control group, highlighting a clear distinction between these two populations. Greater interquartile range (IQR) lengths observed for FBG, OGTT, and LDL indicate increased data dispersion and variability in these markers between control and prediabetes groups.
The analysis also revealed positive skewness in several parameters, as detailed in Table 2, suggesting an asymmetric distribution. Among the biochemical parameters, HDL levels in the prediabetes group exhibited a distribution close to normal or slight positive skewness, as shown in Tables 3 and 4. These visual and statistical insights underscore the significant biochemical differences between control and prediabetes groups, supporting their utility as discriminative markers for early risk assessment.
Correlation analysis
Pearson correlation (r) coefficient was carried out, observed positive correlation between age and waist circumference was 0.2028 (P = 0.001 95% CI). Whereas increased BMI correlates strongly with waist circumference (0.4226) and BMI with FBG (0.2063). Then waist circumference shows a weak positive correlation with FBG, 0.2335 (P = 0.001 95% CI) (Table 5).
Training and predictive performance
To develop and evaluate the predictive model, the dataset was randomly partitioned into training, validation, and testing subsets. Of the total data, 139 samples were allocated for model training (Fig. 2). Individuals were labeled as control (0) or prediabetes (1) based on clinical criteria. The Pattern Neural Network (PNN) model demonstrated efficient and stable training, achieving optimal validation performance with a loss of 0.17602 at epoch 8,079 (Fig. 4). The error histogram (Fig. 5) was centered near zero (− 0.00931), indicating minimal error during training. Errors were quantified as the difference between target and output values.
The model’s predictive performance is summarized in Fig. 6, which displays confusion matrices for the training, validation, testing, and combined datasets. The PNN algorithm effectively distinguished between control and prediabetes classes, as evidenced by the strong diagonal aggregation in the confusion matrices. The classification accuracy exceeded 97.9% in the training set, 95.2% in the validation set, and 95.2% in the testing set, highlighting the model’s robust generalizability and discriminative power.
Input features included BMI, waist circumference, age, gender, percentage scavenging potential, HbA1c, fasting blood glucose (FBG), oral glucose tolerance test (OGTT), triglycerides (TG), HDL, LDL, total cholesterol (TC), hemoglobin (Hb), and VLDL. This comprehensive set of clinical and biochemical predictors enabled the model to capture complex patterns associated with prediabetes risk, supporting its utility as a reliable screening tool.
This study used mathematical representation to create a model, where the input vector was denoted as x = [x1 × 2,…xp]T, where xi characterizes the i-th input feature. Our final model equation was given in (2.2 proposed model development) as Eq. 4. Here we have mentioned the overall derivation of our proposed model.
where,
For instance,
In our model validation, the output value is greater than 0.5, which shows the individuals are at a risk of developing diabetes in future. This model shows best validation performance.
Model validation
The proposed framework was compared with SVM, KNN, and LR to prove the accuracy of the proposed model. Figure 7A shows, higher value of AUC among training, validation, and testing datasets. These confusion matrices show performance of the proposed model which was evaluated using multiple accuracy parameters and represented in the Table 6, including overall accuracy (0.98333), precision (1), recall(0.96154), F1-score (0.98039), and area under the Receiver Operating Characteristic (ROC) curve (AUC-ROC). 7(B) shows the SVM classifier ROC curve, AUC = 0.96606, 7(C) shows the KNN classifier ROC cure, AUC = 0.83937, and 7(D) represents the logistic regression ROC curve, AUC = 0.73643.
Limitations and future work
The study includes numerous limitations: (1) the single-center approach might limit generalizability; (2) the moderate sample size (n = 199) requires validation in larger cohorts; and (3) the cross-sectional methodology hinders the investigation of temporal connections. Future study should include multicentred validation, long-term follow-up, and integration with additional biomarkers.
Discussion
This is the first study to integrate total antioxidant status of the prediabetes patients into an AI-driven prediction model in an Indian cohort. The inclusion of oxidative stress markers provides a mechanistic link between metabolic dysregulation and disease risk, offering a new dimension for risk stratification and personalized intervention.
A study conducted in Romania identified triglycerides, HDL, waist circumference (WC), glucose, and gender as significant risk factors for developing a neural network-based predictive model for prediabetes7. Building on this, our study incorporated a comprehensive set of variables including age, BMI, WC, biochemical parameters (triglycerides, total cholesterol, LDL, HDL, VLDL, HbA1c, fasting blood glucose, oral glucose tolerance test, and hemoglobin), as well as percentage scavenging potential to evaluate oxidative stress levels in individuals.
Our findings revealed significantly low antioxidant levels in prediabetic individuals (14.257 ± 8.360) compared to controls (36.980 ± 13.362), indicating elevated oxidative stress, which is known to contribute to the progression of metabolic complications. Notably, BMI and waist circumference emerged as the strongest anthropometric predictors of prediabetes, particularly in the 17–19-year age group. ROC analysis further confirmed BMI as a robust predictor of prediabetes risk12. Similarly, a study on the Chinese Han population reported a significant positive correlation between BMI, waist circumference, and prediabetes risk13. Similarly, In our study, Pearson correlation analysis demonstrated a strong association between BMI and waist circumference among prediabetic subjects, underscoring their value as key predictive markers.
Previous studies have reported varying predictive accuracies using different AI models: a Romanian population study achieved 80–96% accuracy for model validation7, and a US-based study using Random Forest models reported approximately 89% accuracy8. ANN and SVM have also been applied, yielded accuracies of 65.6% and 69.9%, with AUC values of 0.706 and 0.742, respectively9. Other machine learning algorithms, including logistic regression, naïve Bayes, Random Forest, XGBoost, and extremely randomized trees, have demonstrated accuracies ranging from 66% to 82% for prediabetes prediction10. In a Chinese cohort, the GA_XGBT model showed high precision (0.929), recall (0.951), and an F1-score of 0.9411. Another study by Kumar et al. (2025) reported 88% accuracy and 100% precision in detecting arrhythmia, highlighting its potential in clinical applications implies the impact of oxidative stress32.
In this study, PNN model was utilized for predicting prediabetes in individuals aged 18 to 60 years. This model achieved superior performance, with an accuracy of 98%, precision of 1.0, recall of 0.9615, and an F1-score of 0.9804, indicating its strong predictive capability. To regularize the model, we monitor validation performance and stop training if the cross-entropy loss fails to improve or worsens over six consecutive iterations, thereby avoiding convergence to suboptimal local minima (Fig. 4). This will be predictive tool for early diagnosis and management of prediabetes. It is also less time consuming and less expensive. While the model demonstrates excellent performance, further validation in larger, multi-centre cohorts is warranted to confirm generalizability. Future work will focus on external validation, integration with electronic health records, and prospective evaluation in clinical settings.
Conclusion
In this study, BMI, waist circumference (WC), age, percentage scavenging potential, HbA1c, fasting blood glucose (FBG), and oral glucose tolerance test (OGTT) emerged as key risk factors associated with the progression from prediabetes to diabetes. Our primary objective was to identify the most significant predictors among 14 clinical and biochemical characteristics and mathematically represent them using a Pattern Neural Network (PNN) model.
This study shows the importance of using oxidative stress markers into an AI model, achieving superior accuracy and clinical interpretability. The proposed PNN model offers a robust, actionable tool for early identification and intervention, with the potential to transform preventive strategies in high-risk populations and reduce risk for progressing to diabetes.
The performance of the PNN model was rigorously evaluated using confusion matrices and key metrics including precision, accuracy, recall, and F1-score. The PNN demonstrated superior validation performance across all these indicators, establishing it as the most effective model for predicting prediabetes in our dataset. While the SVM model exhibited high precision and overall accuracy, it showed comparatively lower recall and F1-score values. Similarly, KNN and LR models had reduced recall and AUC values, suggesting they may require further optimization to achieve comparable predictive power.
A strong correlation between BMI and waist circumference further underscores their critical role as a major anthropometric risk factors for prediabetes development. Given its outstanding validation accuracy and balanced performance metrics, the PNN model holds significant promise as a reliable tool for early prediabetes prediction and risk stratification.
Data availability
The datasets generated and/or analysed during the current study are not publicly available due confidentiality of the patient’s information but are available from the corresponding author on reasonable request.
References
Mansourian, M. et al. Factors associated with progression to pre-diabetes: a recurrent events analysis. Eat. Weight Disorders-Studies Anorexia Bulimia Obes. 25, 135–141 (2020).
Twohig, H., Hodges, V. & Mitchell, C. Pre-diabetes: opportunity or overdiagnosis? Br. J. Gen. Pract. 68 (669), 172–173 (2018).
Liu, C. H. et al. Machine learning prediction of prediabetes in a young male Chinese cohort with 5.8-Year Follow-Up. Diagnostics 14 (10), 979 (2024).
Nguyen, H. V., Choi, Y. & Byeon, H. An explainable hybrid deep learning model for prediabetes prediction in men aged 30 and above. J. Men’s Health. 20 (10), 52–72 (2024).
Giri, S. AI-Driven predictive models for early detection of diabetes: A review study. Int. J. Comput. Sci. Mob. Comput. 13 (9), 24–33. https://doi.org/10.47760/ijcsmc.2024.v13i09.004 (2024).
Nature. AI-based diabetes care: risk prediction models and applications. Nat. Reviews Endocrinol. 20 (3), 123–135 (2024).
Vîrgolici, O. & Virgolici, H. Predicting prediabetes using simple a multi-layer perceptron neural network model. In ICIMTH (pp. 168–171). (2023).
Bashar, A. R., Goudarzi, M. & Tsokos, C. P. A machine learning classification model for detecting prediabetes. J. Data Anal. Inform. Process. 12 (3), 462–478 (2024).
Choi, S. B. et al. Screening for prediabetes using machine learning models. Comput. Math. Methods Med. 2014 (1), 618976 (2014).
Yuk, H., Gim, J., Min, J. K., Yun, J. & Heo, T. Y. Artificial Intelligence-based prediction of diabetes and prediabetes using health checkup data in Korea. Appl. Artif. Intell. 36 (1), 2145644 (2022).
Li, J. et al. A tongue features fusion approach to predicting prediabetes and diabetes with machine learning. J. Biomed. Inform. 115, 103693 (2021).
Pandey, U. et al. Anthropometric indicators as predictor of pre-diabetes in Indian adolescents. Indian Heart J. 69 (4), 474–479 (2017).
Ou, Q. et al. Contribution of body mass index, waist circumference, and 25-OH-D3 on the risk of pre-diabetes mellitus in the Chinese population. Aging Male. 27 (1), 2297569 (2024).
Nomura, A., Noguchi, M., Kometani, M., Furukawa, K. & Yoneda, T. Artificial intelligence in current diabetes management and prediction. Curr. Diab. Rep. 21 (12), 61 (2021).
Pranto, B. et al. Evaluating machine learning methods for predicting diabetes among female patients in Bangladesh. Information 11 (8), 374 (2020).
Chuemere, A. N. et al. Correlation between blood group, hypertension, obesity, diabetes, and combination of prehypertension and pre-diabetes in school aged children and adolescents in Port Harcourt. IOSR J. Dent. Med. Sci. 14 (12), 83–89 (2015).
Sagar, K. et al. Diabetes Prediction using Support Vector Machine. In 2024 5th International Conference for Emerging Technology (INCET) (pp. 1–4). IEEE. (2024), May.
Jain, V. Diabetes prediction using support vector machine, naive bayes and random forest machine learning models. In 2022 6th International Conference on Electronics, Communication and Aerospace Technology (pp. 837–841). IEEE. (2022).
Sachdeva, R. K., Thapa, B., Vij, S., Bathla, P. & Ahuja, R. Predictive Method for Diabetes using Machine Learning. In 2023 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (pp. 52–57). IEEE. (2023).
Alqahtani, S. A. M. et al. Feature importance and model performance for prediabetes prediction: A comparative study. J. King Saud Univ.-Sci. 36(11), 103583 (2024).
Mundargi, Z., Dabade, M., Chindhe, Y., Bondre, S. & Chaudhary, A. Diabetes Prediction Using Logistic Regression. In International Conference on Renewable Energy, Green Computing, and Sustainable Development (pp. 51–61). Cham: Springer Nature Switzerland. (2023).
Zueger, T. et al. Machine learning for predicting the risk of transition from prediabetes to diabetes. Diabetes. Technol. Ther. 24 (11), 842–847 (2022).
Kushwaha, S. et al. Harnessing machine learning models for non-invasive pre-diabetes screening in children and adolescents. Comput. Methods Programs Biomed. 226, 107180 (2022).
Kulkarni, A. R. et al. Machine-learning algorithm to non-invasively detect diabetes and pre-diabetes from electrocardiogram. BMJ Innov. 9, 1 (2023).
Tobore, I. et al. Towards adequate prediction of prediabetes using spatiotemporal ECG and EEG feature analysis and weight-based multi-model approach. Knowl.-Based Syst. 209, 106464 (2020).
Olisah, C. C., Smith, L. & Smith, M. Diabetes mellitus prediction and diagnosis from a data preprocessing and machine learning perspective. Comput. Methods Programs Biomed. 220, 106773 (2022).
Guo, F., Moellering, D. R. & Garvey, W. T. Use of HbA1c for diagnoses of diabetes and prediabetes: comparison with diagnoses based on fasting and 2-hr glucose values and effects of gender, race, and age. Metab. Syndr. Relat. Disord. 12 (5), 258–268 (2014).
Chrzczanowicz, J. et al. Simple method for determining human serum 2, 2-diphenyl-1-picryl-hydrazyl(DPPH) radical scavenging activity–possible application in clinical studies on dietary antioxidants. Clin. Chem. Lab. Med. 46(3), 342–349 (2008).
Burke, H. B., Rosen, D. B. & Goodman, P. H. Comparing artificial neural networks to other statistical methods for medical outcome prediction. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94) (Vol. 4, pp. 2213–2216). IEEE. (1994)
Kumar, A. et al. Comprehensive framework for thyroid disorder diagnosis: Integrating advanced feature selection, genetic algorithms, and machine learning for enhanced accuracy and other performance matrices. PLoS One 20(6), e0325900 (2025).
Kumar, A., Dhanka, S., Singh, J., Ali Khan, A. & Maini, S. Hybrid machine learning techniques based on genetic algorithm for heart disease detection. Innov. Emerg. Technol. 11, 2450008 (2024).
Kumar, A., Singh, J. & Khan, A. A. Arrhythmia Detection Using Machine Learning: A Study with UCI Arrhythmia Dataset. In International Conference on Frontiers of Intelligent Computing: Theory and Applications (pp. 217–226). Singapore: Springer Nature Singapore. (2024).
Li, X. et al. Interpretable machine learning method to predict the risk of pre-diabetes using a national-wide cross-sectional data: evidence from CHNS. BMC Public Health 25(1), 1145 (2025).
Acknowledgements
•The authors acknowledge Vellore Institute of Technology, Vellore, Tamil Nadu, India for the facilities provided.•The authors thank Mothers Care Diabetes Centre and theGovernment Vellore Medical College and Hospital staff for their support.
Funding
Open access funding provided by Vellore Institute of Technology. Nil.
Author information
Authors and Affiliations
Contributions
Aarthi Yesupatham- Writing- Manuscript, review of literature, experimental study, sample collection, statistical analysis, and mathematical model development. Dr. Raja Das- Reviewing, conceptualizing, Supervision, Statistical analysis, and Mathematical model development. Dr. Go Bharani- Recruiting samples Biochemical analysis. Dr. Meera S- Assisted during recruiting patients and sample collection. Dr. Radha Saraswathy- Conceptualize - Reviewing, and Supervision.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Human ethics and consent to participate declarations
This study was conducted in accordance with the ethical principles of the Declaration of Helsinki. The research protocol was approved by respective Institutional Ethics committee for Studies on Human subjects. The study received ethical approval from the Institutional Ethics Committee of Vellore Institute of Technology, Vellore (VIT/IECH/XIV/2023/01), Government Vellore Medical College & Hospital, Vellore (GVMCH) (VMC/III/00001/2023), and the Directorate of Public Health and Preventive Medicine, Chennai (DPH) (DPHPM/DPHSAC/2023/250). All participants were informed about the study’s objectives and procedures, and written informed consent was obtained prior to enrolment.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yesupatham, A., Das, R., Bharani, G. et al. Artificial intelligence model as a tool to predict prediabetes. Sci Rep 15, 43421 (2025). https://doi.org/10.1038/s41598-025-23227-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-23227-0






