Introduction

A peptide hormone known as insulin is produced by the beta cells in the pancreases. Insulin plays a crucial role in regulating the transport of glucose and facilitating its absorption by cells from the bloodstream for energy production. Any disruption in this process leads to a medical condition called Diabetes mellitus (DM), more commonly referred as diabetes. Globally, this chronic disease affects nearly 537 million lives with an additional 374 million classified as prediabetic, putting them at similar risks of developing complications1. Diabetes can lead to both, immediate and long-term health issues, including blindness, kidney failures, amputations, heart-related diseases, seizures and in severe cases, even death2,3,4. Diabetes is broadly categorized into type 1 and type 2. In type 2 diabetes, cells become resistant to insulin, hindering their ability to absorb glucose. Type 2 diabetes constitutes 95% of the diabetes population and is typically prevalent in older age-groups. On the other hand, Type 1 diabetes arises when the pancreas fails to produce sufficient insulin. Type 1 diabetes accounts for about 5% of the diabetes population and is mainly found in age groups 0–20 years of age. However, it requires a much more rigorous diabetes management to avoid the imminent consequences. The overarching goal of diabetes management is to maintain euglycemia (normal glucose levels) or increase time spent within the optimal range (time-in-range, TIR)2. Insulin therapy is the mainstay of diabetes management, requiring a delicate balance between short-term and long-term objectives5,6. While diabetes cannot be cured, proper insulin management allows patients with diabetes to keep optimal blood glucose levels and minimize the risk of further complications7.

The most widely used method for glucose monitoring involves the use of fingerstick technique, where a small blood sample is drawn by pricking the finger and analyzed using the glucometer8,9. This was revolutionized by Continuous Glucose Monitoring (CGM) devices, that offer real-time glucose monitoring with automated readings. This has evidently led to increasing time spent by patients in TIR, a crucial metric in diabetes care10,11. CGM devices have proven to enhance glycemic control by reducing both hypoglycemia and hyperglycemia excursions through real-time predictions12,13. While previous research for type 1 diabetes has primarily focused on predicting hypoglycemia events10,14,15,16,17,18,19, achieving glycemic goals requires maximizing glucose readings in TIR by avoiding both hypoglycemia and hyperglycemia.

Efforts to develop machine learning (ML) models for glucose predicting using historical CGM readings have shown reasonable efficacy. These models typically rely on extensive training data, often gathered through large-scale clinical research studies or healthcare providers. The data is often stored in cloud-based servers and poses significant privacy risks, with concerns about data theft and user privacy20. Given the escalating concerns of data theft, user security and privacy, recently many countries have enforced regulations aiming to address these concerns and protection regulations to protect the interests of the people21,22,23. An alternative strategy is to develop individual models with personal data to enhance privacy, but this comes with the drawback of reduced prediction performance due to limited data. ML and deep learning models require substantial data for optimal performance.

Thus, to overcome the challenges in glycemic excursion prediction and privacy concerns, we propose a collaborative learning framework that addresses these concerns simultaneously. The major contributions of this paper are as follows:

  • A novel, HH loss function, for improved prediction performance in the glycemic excursion regions.

  • FedGlu, a personalized machine learning model trained in a federated learning framework for improving model performance and data privacy preservation simultaneously.

The remainder of the paper is organized as follows: Section II reviews the current literature and points out relevant papers. Section III describes the dataset in detail. It also explains the methods like HH loss function, and the federated learning used as part of the paper, different metrics used for evaluation and the different setup of experiments in the paper. Section IV compares the results derived for the different experiments with the HH loss function, comparison between central, local, and personalized federated models. Section V provides discussion and insights on the results, limitations, and future work for the paper. Lastly, Section VI concludes the paper.

State of the Art

In this section, we provide a brief overview of the existing literature on glucose prediction. While a comprehensive review may be out of the scope of this paper, we emphasize works that have focused on predicting glucose values with deep learning approaches, use of federated learning for healthcare applications and extending the global federated model through fine tuning for personalization.

CGM-based glucose prediction

The first attempt to predict future glucose levels using past values dates back to 1999 by Bremer and Gough24. Since then, researchers have continually advanced the literature to develop powerful and highly accurate models. Machine learning models for glucose prediction fall into two categories: (a) Classification tasks, (b) Regression task. Consistent with the theme of our paper, we focus on the regression tasks. Earlier approaches utilized conventional machine learning methods such as linear regression (LR), support vector regression (SVR), random forests (RF), boosting algorithms25,26,27,28 and time-series forecasting methods like autoregressive integrated moving average (ARIMA)29,30 for predicting glucose levels. In recent years, there has been a shift towards employing deep learning (DL) models, leveraging automated feature learning, robust pattern recognition, and abstraction through multiple layers.

Various DL architectures, including multi-layer perceptron (MLP)31,32, convolutional neural networks (CNN) and convolutional recurrent neural networks (CRNN)33,34, recurrent neural networks (RNN)35,36, short long-term memory networks (LSTM)37,38,39, dilated RNNs40,41 and bi-directional LSTMs42,43,44, have been proposed for glucose prediction.

Shuvo et al.45 introduced a deep multi-task learning approach using stacked LSTMs to predict personalized glucose concentration. The proposed approach includes a combination of stacked LSTMs to learn generalized features across patients, clustered hidden layers for phenotypical variability in the data and subject-specific hidden layers for optimally fine-tuning models for individual patient improvement. The authors demonstrate superior results compared to state-of-the-art ML and DL approaches on the OhioT1DM dataset.

A transformer based on an attention mechanism was recently proposed to forecast glucose levels and hypoglycemia and hyperglycemia events46. The proposed transformer network includes an encoder network to perform the regression and classification tasks under a unified framework, and a data augmentation step using a generative adversarial network (GAN) to compensate for the rare events of hypoglycemia and hyperglycemia. Results were demonstrated on two datasets, one including type 1 diabetes patients and the other on type 2 diabetes patients.

In pursuit of enhanced prediction performance, researchers have often grappled with the constraint of limited data for individual subjects or patients. A recently proposed approach addresses this challenge through multitask learning for advancing personalized glucose prediction47. This approach was evaluated against sequential transfer learning, revealing two key findings: (a) individual patient data alone may not suffice for training DL models and (b) a thoughtful strategy is crucial to leverage population data for improved individual models. The dataset that was used in this study was the OhioT1DM dataset.

While the literature commonly employs standard regression metrics like root-mean squared error (RMSE), mean absolute error (MAE), and Clark’s Error Grid Analysis (EGA) for clinical context, an often-overlooked aspect in the analysis of these metrics in glycemic excursion regions. For instance, RMSE in hypoglycemia and hyperglycemia ranges can be evaluated, Clark’s EGA can be applied to glucose readings falling in Zones C, D, and E. Mu et al.48 introduced a normalized mean-squared error (NMSE) loss function, demonstrating a substantial reduction in RMSE for the hypoglycemia range. However, comprehensive details regarding its superiority over MSE in detecting hypoglycemia are lacking, and its performance in hyperglycemia ranges is not addressed.

Data imbalance

There is a significant imbalance in glucose data distribution across the hypoglycemia, hyperglycemia, and normal ranges. Typically, only 2–10% of glucose readings fall into the hypoglycemia range, while about 30–40% fall in the hyperglycemia range. From a statistical perspective, this presents a classic case of an imbalanced regression problem. While numerous approaches have been developed to tackle the imbalanced data in the classification setting, very few works have been proposed to address the imbalance regression problem, like ours, in the literature49. Existing approaches for imbalanced regression can be broadly categorized into two types:

Sampling-based approaches

These methods attempt to either under sample the high-frequency values or oversample the low frequency (rare) values. However, determining the ‘notion or rarity’ in a regression problem is challenging compared to a classification task. Oversampling may lead to overfitting, whereas under sampling may result in sub-optimal performance because of loss of key information. Chawla et al.50 proposed an approach that generates synthetic samples by combining oversampling and under sampling of the training data for classification task.

Cost-sensitive approaches

Cost-sensitive approaches: These approaches introduce a penalizing scheme during training to enable the model to handle outlier values (low-frequency or rare values) enhancing its effectiveness in predicting within those ranges. The recent success of this approach51,52,53 motivates us to explore this approach further through our customized loss function.

Federated learning for healthcare

Federated learning (FL) has substantial disruptive potential in healthcare, a domain constrained by sensitive data and strict regulations such as the Health Insurance Portability and Accountability Act (HIPAA) in the US. The reluctance of healthcare entities to share sensitive data has fueled the adoption of FL in various healthcare applications, like medical image processing54,55,56, IoT-based smart healthcare applications57, managing electronic health records (EHR)58, disease prediction59,60,61, predict hospitalizations and mortality62,63,64, natural language processing from clinical notes65,66,67, etc. A few researchers comprehensively reviewed federated learning in healthcare68,69,70,71.

In the diabetes literature, FL has recently gained traction. A recent study proposed a deep learning approach in the Diabetes Management Control System (DMCS)72, using 30 virtual subjects from the FDA approved UVA/Padova Type 1 diabetes simulator. Features such as past glucose values, carbohydrate intake, and insulin-on-board were used as input for the model for the diagnosis of diabetes. The findings clearly demonstrate the superior performance of the federated model over the local models.

Another study employed an FL-inspired Evolutionary Algorithm (EA) for classifying glucose values into different risk categories73 using data from 12 patients in the OhioT1DM dataset. The input features include previous CGM readings, carbohydrates, and insulin data. The results indicate improved performance over local models using an FL-based EA. However, the study’s limitation lies in its small sample size of only 12-patients and the absence of addressing real-time glucose-related risk-prediction and its clinical significance.

A decentralized privacy-protected federated learning approach was applied to predict diabetes-related complications using patient-related comorbid features extracted from International Classification of Disease (ICD) codes in real-world clinical datasets74. For this, a logistic regression model, 2-layer multi-perceptron model and 3-layer multi-perceptron models were proposed and compared against a centralized (population-level) model. The study addressed class imbalance using techniques like under-sampling, oversampling, and balancing. The results indicate that models developed through the federated learning framework can achieve promising performance that is comparable to the centralized models.

Personalized federated learning

Federated learning has its many unique challenges75,76,77, with one prominent issue being the variation in data distribution across clients in the network. This is characteristic of non-i.i.d. and imbalanced data78. This is particularly significant in healthcare applications where individual patients possess diverse demographics and health histories, necessitating adaptive solutions tailored to each participant79.

One of the pioneering works in federated learning with wearable healthcare data is FedHealth80, which introduces personalization through transfer learning. This is achieved by first training a conventional global model in a federated learning framework and later fine-tuning two fully connected layers (through transfer learning) to learn activities and tasks for specific users. The study utilizes publicly available human activity recognition data from accelerometry and gyroscope data across 30 users for multiple activity class prediction. The authors employ a CNN based deep learning model, comparing its performance against traditional machine learning methods like RF, SVM, and KNN. The results demonstrate a 4% average improvement in performance for personalized models compared to the global model.

Another noteworthy application of personalized federated learning was used for in-home health monitoring81. The authors introduce FedHome, a cloud-edge based federated learning framework, where a shared global model is initially learned from multiple network participants. Individual personalization is achieved using a generative convolutional autoencoder (GCAE), aiming to generate a class-balanced dataset tailored to individual client’s data. FedHome exhibits a notable improvement of over 10% in accuracy compared to a conventional global federated learning model.

A recent study on remote patient monitoring (RPM) introduced FedStack architecture, a personalized federated learning approach79. The study was based on the MHEALTH dataset with 10 patients82,83. A total of 21 features are extracted from three sensor data types (accelerometry, gyroscope and magnetometer) to classify 12 different natural activities. Three different model architectures (ANN, CNN, Bi-LSTM) are used in the study. FedStack architecture achieves personalization by aggregating heterogenous architectural models at the client level and demonstrates that the FedStack approach consistently outperforms both local and global models.

A recent work in personalized federated learning focused on in-hospital mortality prediction62. The study utilized a publicly available electronic health records (HER) database, comprising over 200,000 patients across 208 hospitals in the US. Features were extracted from patients’ HER and employed for a binary classification problem, with a multi-layer perceptron (MLP) adopted for modeling. The proposed POLA method involves the initial training of a global federated learning model, referred to as the teacher model. In the subsequent step, local adaptation is accomplished through a Genetic Algorithm (GA) approach. Comparative results highlight the superior performance of the POLA approach against the traditional FedAvg84 and two other state-of-the-art personalized federated learning architectures85,86.

For CGM based glucose prediction in T1D populations, a recent paper proposes an asynchronous and decentralized federated learning approach where future glucose trajectories are predicted from historical CGM values87. The proposed work also signifies a personalized model where individual patients can benefit and is tested with multiple publicly available datasets for adult T1D patients. Another recent approach focused on a privacy-preserving glycemic management for T1D use a federated reinforcement learning framework where the goal was to maximize the time-in-range (TIR) and reduce glycemic risk scores88. The study was conducted on data generated from 30 individuals using a UVA-Padova T1D simulator. Federated learning holds great potential in overcoming the challenges of privacy and delivering personalized predictions that can benefit individuals with T1D.

Methods and materials

This section describes the datasets used in this study, data processing steps and the experimental setup. Further, we describe the prediction model, and the different frameworks used in this study.

Clinical datasets and preprocessing

Ohio T1DM

The Ohio T1DM89 dataset was publicly released in two batches (2016 and 2018) with a total of 12 participants. During the 8-week study period, all 12 participants wore a Medtronic Enlite CGM for collecting glucose readings, a Medtronic 530G/630G insulin pump and an Empatica/Basis sensor for collecting physiological data collection. For our analysis, we only focus on the CGM data. The patient data was collected in free-living conditions.

TCH study

This is a proprietary data collected at Texas Children’s Hospital, Houston, TX. from a total of 113 T1D patients using Dexcom CGM devices. The data for each patient spans a period of 30–90 days. Additional information about this dataset is available in our previous publications25. Similar to the OhioT1DM dataset, the data for all patients was collected under free-living conditions.

Comprehensive details about these two datasets are provided in Table 1. Figure. 1 illustrates glycemic excursions across all patients, specifically hypoglycemia and hyperglycemia profiles. The dotted grey lines on the x-axis and y-axis represent the median hypoglycemia and hyperglycemia percentages respectively. It is evident that the prevalence of hyperglycemia (\(\:\stackrel{-}{x}=41.7\%and\:x\cong\:22.7\%\)) is considerably higher compared to hypoglycemia (\(\:\stackrel{-}{x}=2.3\%and\:x\cong\:1.7\%\)).

Data preprocessing

Based on the available literature90, data preprocessing involves two main steps: first, replacing ‘Low’ and ‘High’ glucose readings with 40 mg/dL and 400 mg/dL, respectively. Second, imputing missing glucose readings through linear interpolation when less than six consecutive readings are missing. This threshold is chosen based on the literature available on autocorrelation between glucose readings27,91. After interpolating missing values, consecutive sequences of glucose readings in the last two hours (24 readings) are taken as individual samples. This time-window is selected based on our previous works for glucose prediction12. Any samples with missing values at this stage are excluded from our analysis. All preprocessing steps have been taken following standard practices in the literature for analyzing CGM data92. For our analysis, we compare three different model types (Figure. 3), frameworks based on the data used for training and the training process:

Table 1 Datasets used in the study.
Fig. 1
figure 1

Glucose profile for patients in TCH and OhioT1DM dataset.

HH loss

Our custom loss function is derived on the principle of Taguchi loss function93,94 and is based out of two primary needs: (a) enhancing penalties for errors in glycemic excursion regions, and (b) simultaneously balancing these penalties considering the uneven distribution of samples in the hypoglycemia, normal and hyperglycemia glucose ranges. For the first goal, we add a polynomial increasing penalty for glucose readings further they deviate from the normal range. To achieve the second objective i.e., to account for the disproportionate sample distribution in the hypoglycemia, hyperglycemia, and normal glucose ranges, we introduce a tuning parameter \(\:\alpha\:\). This parameter ensures that while reducing the errors in the glycemic excursion regions, the predictive performance for overall glucose readings is not compromised.

$$\:HH\:Loss=\{\:\begin{array}{c}SE\:,\:{Y}_{i}\ge\:70\:and\:{Y}_{i}\le\:180\\\:SE+\alpha\:*penalty,\:{Y}_{i}<70\:\\\:SE+\left(1-\alpha\:\right)*penalty,\:{Y}_{i}>180\end{array}$$
(1)
$$\begin{array}{c}\:Squared\:Error\:\left(SE\right)={\left({Y}_{i}-\:\widehat{{Y}_{i}}\right)}^{2},\\ \:penalty=\left|\:{Y}_{i}-\:\widehat{{Y}_{i}}\right|*{\left({Y}_{i}-c\right)}^{2},\\ \:\alpha\:\:\in\:\left(\text{0,1}\right)\\ \:c=125\:\left(midpoint\:of\:glycemic\:excursion\:boundaries\right)\end{array}$$

.

As outlined in Eq. (1), a penalty is applied, equivalent to the square of the distance of the glucose value and the mid-point of the glycemic excursion boundary, for values falling within the hypoglycemia and hyperglycemia ranges. We also provide a visual institution behind the HH loss function in Figure. 2. There is no additional penalty for errors in the normoglycemia but for different values of \(\:\alpha\:\) there is a significant additional penalty in the hypoglycemia and hyperglycemia regions forcing the model to perform better in these critical excursion regions. The value of \(\:\alpha\:\) is tunable to adjust for varying data distributions. For a more detailed derivation, please refer Appendix.

Fig. 2
figure 2

Intuitive explanation of the HH loss function and additional penalty imposed on errors in the glycemic excursion regions.

Federated learning

Federated Learning is a distributed learning framework that enables the training of machine learning models without transmitting sensitive user data to a central server. It relies on collaborative learning between participating users, where each user trains a shared model locally on its own training data through multiple rounds of optimization. Only the model characteristics such as model parameters, weights, gradients etc. are shared among users in the network. The sharing process can occur directly among the participating users or through a central server depending on the FL network topology. After receiving and aggregating the shared model characteristics from all users (e.g.: simple average, weighted averaging), the aggregated information is relayed back to users for further optimization by participants on their individual local data. This iterative communication process continues through multiple rounds until a stopping criterion (e.g.: convergence) is met. All user data remains stored locally, and only specific model characteristics are shared during this collaborative learning process. This approach reduces network and communication costs as model parameters are much smaller in size than actual training data. Additionally, it provides increased accessibility and comparable accuracy to conventional machine learning models. In addition to this, some level of privacy guarantees is ensured. The decentralized nature of federated learning allows patients and clinicians to benefit from more accurate and reliable models trained across a more extensive and diverse data pool.

A general formulation of the federated learning can be expressed as follows:

$$\begin{array}{c}\:\underset{\phi\:}{\text{min}}L\:\left({X}_{i};\phi\:\right),\:\\ \:L=global\:loss\:function\\ \:K=local\:losses\:{{\left\{{L}_{k}\right\}}^{K}}_{k=1}\\ \:{X}_{k}=private\:glucose\:data\end{array}$$

Glucose prediction specific problem formulation

In general, input for a data-driven algorithm for predicting future glucose levels primarily consists of historical glucose readings (\(\:g\)) and other features (physiological data, insulin, and carbohydrates intake etc.) if available. In our case where, we predict future glucose readings based on historical glucose readings observed. More specifically, we will learn patterns based on glucose readings observed in the past two hours to predict glucose readings 30 minutes in the future. This is so that individuals can take interventions to prevent glycemic excursions. Figure. 3 provides a graphical representation of the different learning paradigms that we use in our analysis.

Fig. 3
figure 3

Visual representation of the different model types/frameworks in this work.

Local model

Data is stored locally with the patient and not shared across other entities, such as a central server or other patients in the network. While this approach ensures full data privacy, the training data for training the model is limited to the individual patient and risks suboptimal performance. As shown in figure. 3, for local models, all the data storage and model training occurs locally and a central server has no role to play.

Central model

Unlike with the local model, in this setting, patient data from all individuals are shared at a central server, where a single joint model is trained on data from all individuals in the network. While the central model benefits from a large data pool of data for training, it has minimal privacy for individual patients, as their entire patient data corpus is shared with a central server. As shown (Figure. 3), all data and model training occur at the central server where individual patient data has been stored.

Federated (global) model

Federated model (or sometimes also called global model) combines advantages of both the local and central models by bringing together robust prediction capabilities of the central model along with the data privacy aspects of the local models. In this approach, data storage and model training occur locally on the patient’s device. During the training process, optimization of parameters locally for each individual patient over their own data and model weights/parameters are shared frequently with a central server for aggregation. These aggregated weights are then returned to individual patients and serve as initializers for further training. Actual data is never shared with other patients in the network or with a global server, maintaining privacy. This is known as model development in a federated-learning framework84. Models in this framework achieve better prediction performance through shared learning with other patients while preserving privacy. However, global models may not always be the most suitable for every entity (here, patient) in the network. To address this, we extend the global model via a fine-tuning step, by personalizing the global model for each individual patient with the custom HH loss function to achieve personalization for individual patients in the network.

FedGlu model

FedGlu is an extension of the federated model that personalizes model for individual patients. A simple workflow is presented in Figure. 4. The holistic mechanism of the FedGlu algorithm is presented in Figure. 3 where the central server initializes the global model with random weights and broadcasts it to each participating patient in the network. Each patient trains this globally shared model on its own data using the MSE loss function and once trained for a fixed number of epochs, send the trained weights to the central server for aggregation. The central server will receive updated model weights from each of the patients where the global model was trained using a patient’s own individual local data. This is one communication round within the federated learning step. The server aggregates model weights received from all patients, checks if convergence is reached and broadcasts the updated global model back to the patients for further training. This process continues until global model achieves convergence after which the central server sends the final global model to all the participating patients in the network. Once the patient has received the final global model, patients will fine-tune this global model one more time but this time using the HH loss function. This fine-tuning step is asynchronous and helps achieve the desired personalization leveraging data for training robust models across all patients as well as adjusting for unique local data distributions.

Fig. 4
figure 4

Flowchart steps involved in executing FedGlu.

Model architecture

A multilayer perceptron model is used for analysis in this study. The input layer consists of glucose readings observed in the last 2 h. This input with dimensions \(\:24\:x\:1\) is fed into a dense layer of 512 neurons with a rectified linear unit (ReLU) activation function. The output feature map is passed through two hidden layers with 256 neurons and outputs a 64-neuron layer which, both activated with a ReLU activation. The final output layers get their input from the 3rd dense layer in the network and provide a single number prediction of the future glucose reading.

Validation approach

In line with our prior research12,14, we implement a 5-fold validation strategy with temporally partitioned splits. This strategy guarantees that training and testing splits are derived from non-overlapping time-windows, mitigating potential biases from temporal correlations that could lead to overly optimistic results. This methodology aligns with the BGLP Challenge, where the initial few days of data are utilized for training and the subsequent days constitute the hold-out test set. In our work, we extend the same across multiple splits to make it more robust.

Evaluation metrics

We illustrate the efficiency of our approach using standard metrics in glucose prediction: root mean squared error (RMSE) along with Clark’s Error Grid (Figure. 5)95 to evaluate clinical significance of the predictions. RMSE is defined as:

$$\begin{array}{l}\:RMSE=\:\sqrt{\frac{\sum_{i=1}^{N}{\left({y}_{i}-\:\widehat{{y}_{i}}\right)}^{2}}{N},}\\ \:N=number\:of\:data\:points\\ \:{y}_{i}=true\:value\\ \:\widehat{{y}_{i}}=predicted\:value\end{array}$$

Clarke’s Error Grid Analysis (CEGA) is described in Figure. 5 and the definitions of each zone are provided in Table 2.

Table 2 Clarke’s Error Grid Analysis (CEGA) zone-wise description.
Fig. 5
figure 5

Reference clark’s error grid.

Model training/hyperparameter tuning

To provide robust estimates of our methodology and support the enlisted contributions with results, we consider the following experiments:

  1. 1.

    Setup 1 – Advantage of HH loss function: The performance with the HH loss function is compared against the standard mean squared error (MSE) loss. We demonstrate the advantage in terms of standard regression metrics and the clinical significance for glucose predictions. We compare the HH loss function results across two separate datasets namely TCH study data and the OhioT1DM dataset. We use a central (population-level) model to obtain the results.

  2. 2.

    Setup 2 – Usefulness of Personalized Federated Models: We compare the performance across the different model types: central model, local model, global federated model, and the personalized federated model. The comparison is made independently across both datasets. This setup will prove the robustness of ML models trained in a federated learning work against the central and local models.

Evaluation metrics

All models at central, local, and federated (global and personal) levels shared identical architecture and parameters. The learning rate was fixed at 0.001 with a constant batch-size of 500. The training was set to a maximum of 50 epochs, with early stopping for a patience of 10 epochs to avoid prevent redundant training and thereby reduce computational time. The global federated model is trained with TensorFlow-federated. The client-optimizer was set to ‘Adam’ with a learning rate of 0.001 and the serve-optimizer to ‘SGD’ with a learning rate of 1 to mimic the baseline federated model84. The number of communication rounds was set to 50, and a simple weighted-average proportional to the number of samples with each patient (node) in the network was used as the aggregating function. The global federated model is saved after observing the train loss (MSE) convergence. Individual patients (nodes) then fine-tune this saved global federated model on their individual local data with an ‘Adam’ optimizer and a learning rate of 0.001. This fine-tuning step is however done the custom HH loss function. For comparing results, the local-level and (personalized) federated-level models, with a specific \(\:\alpha\:\), is selected based on the training data where the combined RMSE (across hypoglycemia and hyperglycemia) is minimum. At the central level, a single \(\:\alpha\:\) which showed the maximum combined reduction across all patients, was considered.

Results

This section describes the datasets used in this study, data processing steps and the experimental setup. Further, we describe the prediction model and the different frameworks used in this study.

Advantage of HH loss function

In the first analysis, we assess how the HH loss function enhances prediction performance in the glycemic excursion regions while maintaining clinical significance for overall predictions based on two independent datasets. For this we consider a centralized (population-level) model development setting. Since a common model is trained across all the patients for the dataset, we choose a single\(\:{\prime\:}\alpha\:{\prime\:}\) value (\(\:\alpha\:=1\)) which showed the combined maximum improvement (reduction in RMSE) across hypoglycemia and hyperglycemia ranges on the training dataset.

TCH study dataset

Figure. 6 (left), gives a performance comparison between the HH loss and the baseline mean squared error (MSE) loss. The HH loss function exhibits a 52% reduction in root-mean-squared error (RMSE) for the hypoglycemia region compared to MSE, while showing similar performance in the hyperglycemia region. Although there is a slight dip in overall prediction performance (regarding RMSE values), the clinical impact evaluated through Clark’s EGA is negligible and on the contrary better with HH loss than MSE.

In our evaluation using Clark’s Error Grid (Table 3) to gauge the clinical relevance of the predictions, the HH loss exhibited superior performance, missing to detect an average of only 0.51% of excursions, compared to 2.07% with MSE - representing a substantial 75% reduction. This refers to Zone: D + E from CEGA, and predictions falling in this region are analogous to false negatives from the confusion matrix used to detect hypoglycemia and hyperglycemia. Excursion detection is critically important as it allows patients to take intervention measures based on the prediction. Additionally, overall glucose predictions within the combined Region A + B improved from 97.94% to 99.39%. These findings underscore that, while the HH loss introduces a bias (increasing predictions in Zone: C) towards enhancing accuracy in excursion regions (hypoglycemia and hyperglycemia), it does not detrimentally affect the clinical accuracy of overall glucose predictions.

Fig. 6
figure 6

Performance comparison: MSE vs. HH vs. NMSE for (left) TCH study data and (right) OhioT1DM dataset.

Table 3 Summary of clarke’s error grid analysis (CEGA) performance for different loss functions.

Ohio T1DM dataset

For the OhioT1DM dataset, Figure. 6 (right) a similar performance comparison, the HH loss exhibits an RMSE of 15.12 for the hypoglycemia region, which is 41% less than the MSE, whereas in the hyperglycemia region the RMSE remains relatively consistent. Similar to the performance observed with the TCH study data, there is a decline in overall prediction performance in terms of RMSE values. However, this performance drop is trivial when assessed for the clinical significance with CEGA. The CEGA provides evidence for this claim. The number of data points in the regions: D + E from reduces from 2.05% with MSE to 0.32% (84% reduction) with the HH loss – indicating an increase in the model’s ability to predict excursions. Concurrently, the percentage of data points in the regions: A + B increases from 97.89% to 99.59%, suggesting an improvement in overall predictions without compromise.

Model training in a federated learning framework

The predictive capabilities of models incorporating the HH loss function were extensively evaluated at the local, central, and federated levels. In Figure. 7, the magnitude of difference (measured in RMSE) in various glycemic excursion regions is presented. Table 4 compares different model types through Clarke’s Error grid analysis, while Table 5 outlines improvements in terms of the number of patients for RMSE and CEGA. Compared to local models, federated models exhibit an improvement of 16.67% (in RMSE) for predicting hypoglycemia and 18.91% (in RMSE) for predicting hyperglycemia simultaneous – both statistically significant using the paired t-test (\(\:p\ll\:0.01\)). For the TCH study data, federated models demonstrated improvements for 96 (out of 113) patients in hypoglycemia region and 110 (out of 111) patients in hyperglycemia regions over local models simultaneously. Regarding the OhioT1DM dataset, federated models improved by 9% (in RMSE) for predicting hypoglycemia and 33.29% (in RMSE) for predicting hyperglycemia. Although the improvement in hyperglycemia is statistically significant (\(\:p\ll\:0.01\)) but hypoglycemia (\(\:p=0.18\)) is not. For the OhioT1DM dataset, federated models improve for 9 (out of 12) patients in hypoglycemia region and 12/12 patients in hyperglycemia region, over the local models. Across the two datasets, there is an improvement of 12.37% (in RMSE) for predicting hypoglycemia and 29.05% (in RMSE) for prediction hyperglycemia – which are both statistically significant with (\(\:p\ll\:0.01\)).

When evaluated for clinical significance through Clark’s Error Grid, it is evident that HH loss function improves glycemic excursions (Region: D + E) prediction at all levels without compromising the clinical significance of overall predictions (Region: A + B) across both datasets. Comparing local and federated models, for the TCH study data, we see a 37% reduction (\(\:p\ll\:0.01\)) and for the OhioT1DM dataset, 31% reduction (\(\:p\ll\:0.01\)) in points falling in the Region: D + E (compared to MSE) which signifies an increase in the detection capability of glycemic excursions. When comparing local and federated models for predictions falling in the Region: C of CEGA, for the TCH study data, there is a 50% reduction whereas for the OhioT1DM dataset, there is a 20% reduction in which signifies fewer false predictions with the federated models. Table 4 also provides evidence that federated models when compared to local models through CEGA, are able to improve over local models for all regions of CEGA for almost all patients across the two datasets.

On the other hand, central models distinctly outperform federated models for both hypoglycemia and hyperglycemia due to their advantage of a significantly larger training pool. This is evident in the evaluation with RMSE and CEGA. However, compared to the local models, federated models achieve a much closer performance to central models. Table 5 outlines the improvement with federated models against central and local models for each of TCH and OhioT1DM datasets.

Fig. 7
figure 7

Comparing improvements with federated Model against local and central models for hypoglycemia and hyperglycemia.

Table 4 Summary of clarke’s error grid analysis (CEGA) performance for different model types.
Table 5 Federated model improvements over (a) local and (b) central models.

Discussion

Based on the results, the main contributions of our work can be summarized as:

  1. 1.

    Introduction of a novel HH loss function aimed at improving glucose predictions in the excursion regions while ensuring clinically significance.

  2. 2.

    Model implementation with the HH loss function within a federated learning framework, balancing performance in the excursion region and privacy.

Insights and observations

Models with HH loss function can predict more accurately and clinically significantly in glycemic excursion regions at local, central, and federated levels. This performance may result in sub-optimal RMSE values in the overall glucose ranges. However, when compared for clinical significance, there is no reduction and on the contrary, it improves overall predictions in regions: A + B. A detailed evaluation with CEGA with different model types, including local, central, and federated models is presented in Table 6. A further expansion of CEGA zones for Table 6 is provided in Appendix 1.

In the proposed HH loss function, the parameter ‘\(\:{\alpha\:}^{{\prime\:}}\) plays a vital role in penalizing the errors and achieving balance across the three different glucose regions. This can be optimally chosen for the local and federated models as model training is done locally. However, in the case of the central model, we can choose a single value for the entire cohort of patients. This may be beneficial to the majority of patients but not all the patients. We further explore the impact of choosing different ‘\(\:{\alpha\:}^{{\prime\:}}\) values for the central model.

Also, when we compare model performance across the different model types, we see that federated model has varying level of differences compared to local and central models. We try to investigate further these improvements concerning the glycemic profiles of patients and who is likely to benefit the most.

Table 6 Summary of clarke’s error grid across different (a) model types and (b) loss functions.

Impact of ‘\(\:\varvec{\alpha\:}\)’ parameter

The parameter ‘\(\:{\alpha\:}^{{\prime\:}}\) can be tuned to achieve a well-balanced optimal prediction performance for hypoglycemia and hyperglycemia excursions regions simultaneously. A higher \(\:{{\prime\:}\alpha\:}^{{\prime\:}}\) value prioritizes performance in the hypoglycemia regions whereas a lower \(\:{\prime\:}{\alpha\:}^{{\prime\:}}\) emphasizes performance in the hyperglycemia region. Figure. 8 illustrates the average improvement (in RMSE) across patients in both glycemic excursion regions occurring co-occurring for various \(\:{\prime\:}{\alpha\:}^{{\prime\:}}\) values. The red curve, representing improvement in hypoglycemia values, shows an exponential increase compared to MSE as \(\:{\prime\:}\alpha\:{\prime\:}\) increases. In contrast, the blue curve indicating improvement in hyperglycemia values, exhibits a downward trend with increasing \(\:{{\prime\:}\alpha\:}^{{\prime\:}}\). However, the slope for hypoglycemia values is significantly steeper than hyperglycemia values. This is because of the high data imbalance that exists, especially for hypoglycemia values (median: 1.64%) as compared to hyperglycemia values (median: 43%). This also highlights the greater of accurately predicting hypoglycemia values compared to hyperglycemia values. The parameter \(\:{{\prime\:}\alpha\:}^{{\prime\:}}\) can be customized for local and federated models based on individual preferences and glycemic profiles to yield optimal results. Table 7 provides performance metrics (for central models) in terms of Clark’s EGA for different ‘\(\:{\alpha\:}^{{\prime\:}}\) values.

Fig. 8
figure 8

Improvement in glycemic excursion regions with HH loss over MSE for the Central Model.

Table 7 Central model performance with different values of ‘\(\:\alpha\:\)’ parameter (TCH study + Ohio T1DM data).
Table 8 Central model performance with different values of ‘\(\:\alpha\:\)’ parameter (TCH study + Ohio T1DM data).
Table 9 Central model performance with different values of ‘\(\:\alpha\:\)’ parameter (TCH study + Ohio T1DM data).

Federated models: who benefits the most?

We compare the performance improvements achieved with federated models against central and local models (with HH loss) across varying glycemic profiles of both hypoglycemia and hyperglycemia regions. To categorize patients effectively, they are initially binned into groups of ten based on the percentage of hypoglycemia or hyperglycemia values in their profile. For hypoglycemia, these intervals include (0, 0.22]%, (0.22, 0.5]%, and so forth, while for hyperglycemia, intervals are defined as (0, 7.81]%, (7.81, 19.31]%, and so on.

In the context of hypoglycemia prediction (Figure. 9), a notable trend emerges as the percentage of hypoglycemia values increases, the prediction performance of federated models, global models and local models converges. Patients with higher hypoglycemia values exhibit similar prediction performance across local, central, and federated models. However, for patients with extremely low instances of hypoglycemia, central models far outperform local and federated models. This is because training data for central models has multiple instances of hypoglycemia for the model to train whereas local and federated models do not have that advantage. The Spearman correlation coefficient (\(\:\rho\:\)) between mean improvement (in RMSE) with federated models and increasing hypoglycemia (interval) values, is statistically significant against both local (\(\:\rho\:=0.01\)) and central models (\(\:\rho\:=0.02\)). When compared for variance, it shows similar trend and is statistically significant across local (\(\:\rho\:=0.01\)) and central (\(\:\rho\:=0.01\)) models. It is also observed that, predictions performance with federated models is closer to the central models and also maintain a clear advantage over the local models for close to 50% of the patients.

In contrast, for hyperglycemia prediction (Figure. 9), no discernible patterns emerge across the different model types (federated model, global model, and local model) when compared to the hyperglycemia profile of patients. The Spearman correlation coefficient (\(\:\rho\:\)) for improvements with federated models over local (\(\:\rho\:=0.7\)) and central models (\(\:\rho\:=0.1\)) is not statistically significant. Furthermore, the performance disparity across these three model types (relative to hypoglycemia) is minimal for hyperglycemia prediction. A major contributing factor is a substantial difference in the number of hypoglycemia glucose values compared to hyperglycemia glucose values (the lowest bin for hyperglycemia is (0, 7.81]% and a median of 1.64%. In contrast, the highest bin for hypoglycemia is (4,63, 12.13]% and a median of 43%).

Beyond the presence of hypoglycemia and hyperglycemia samples in the data, we evaluated the impact of different individual profile factors including gender, duration of diabetes and HbA1c values on the model performance. However, none of these factors showed any visual or statistical difference in performance between the central and federated models. This further corroborates the glycemic profiles of individuals representing the distribution of hypoglycemia and hyperglycemia samples within a patient has the biggest impact on the performance of different model types.

Fig. 9
figure 9

Federated Model performance vs. Local and Central Models across different patient profiles in (left) hypoglycemia region and (right) hyperglycemia region.

Comparison with literature

We comprehensively compared between the HH loss function and a recent state-of-the-art NMSE (normalized mean squared error) that showed significant performance improvement in the hypoglycemia range. The comparison is made on standard evaluation metrics of RMSE and Clark’s EGA. Analyzing the TCH study data, HH loss improves performance against NMSE (Figure. 10 - left) by 29%. (\(\:p\ll\:0.01\)) in the hypoglycemia region and by 5% (\(\:p\ll\:0.01\)) in the hyperglycemia region. Similarly, for the OhioT1DM dataset (Figure. 10 - right), HH loss improves performance (in RMSE) in the hypoglycemia region by 21%. (\(\:p\ll\:0.01\)) and hyperglycemia region by 6% (\(\:p\ll\:0.01\)). Moreover, when compared using Clark’s EGA, models with the HH loss function fails to detect only 0.51% of glycemic excursions compared to 1.47% with NMSE (a 65% reduction). We also see an increase in points falling in Regions: A + B with the HH loss function compared to using either of MSE or NMSE (Table 8).

The OhioT1DM dataset, used in the BGLP challenge, is a standard dataset for comparing glucose prediction performance. While the BGLP challenge and other literature aim to show accurate predictions of overall glucose values, our work focuses explicitly on improved performance in the glycemic excursion regions (hypoglycemia and hyperglycemia), making comparisons challenging. Nevertheless, we attempt to show our work in the light of other notable studies based on CEGA, providing a standard and fair way to assess clinical significance.

Table 9 compares research achieving state-of-the-art performances using the OhioT1DM dataset. Remarkably, our simple MLP model, with several hundred times lower number of training parameters, outperforms or matches the top-performing models reported. Additionally, federated models, lacking access to the entire training data, show promising results compared to other state-of-the-art methods proposed in the literature.

Fig. 10
figure 10

Performance comparison: MSE vs. HH vs. NMSE for (left) TCH study data and (right) Ohio T1DM.

Clinical applications

T1D requires rigorous management as hypoglycemia can lead to instantaneous life-threatening consequences while chronic hyperglycemia can result in a hybrid diabetes (both T1D and T2D), severely complicating diabetes management. Type 1 diabetes care prioritizes preventing impending hypoglycemia events while limiting chronic hyperglycemia. yet many individuals still miss > 70% time-in-range recommended targets. CGM provides short-horizon glucose predictions which are critical for diabetes care. Our framework is designed for future clinical integration in a way that preserves privacy and fits existing data flows. In a multi-site setting, FL models have demonstrated success at hospital scale in the EXAM study across 20 institutions, which improved external generalization without pooling protected data114. Multiple studies have suggested the utility of a federated learning framework by showing its technical compatibility with existing IT and clinical operations115. Specifically for diabetes workflows, CGM data are already integrated within EHRs where providers can access patient’s historical CGM trends116. In such settings, the use of model parameters over centralizing protected health information is of paramount importance. Individual patient participation in periodic federated rounds can be managed within current governance structures (model versioning, convergence checks), and outputs can be written to the EHR through standard interfaces. The tunable ‘\(\:\alpha\:\)’ parameter allows individual-level personalized priorities and glycemic targets to be achieved in practice. This positioning enables future integration without altering existing devices or centralizing patient data.

Limitations and future work

We introduce a bias towards glycemic excursion regions to enhance performance in the hypoglycemia and hyperglycemia regions. As a result, the prediction error for overall glucose does increase. However, we overcome this limitation by ensuring that clinical significance is not compromised. Additionally, when using HH loss, the federated models undergo a two-step construction process: (i) initial training with MSE loss function and (ii) subsequent fine-tuning with the HH loss at local levels. In future endeavors, we aim to implement a single step personalized federated model combining these two phases. This work uses the baseline federated learning framework to develop the proposed algorithm, FedGlu. There can be many challenges for its adoption in the real-world, like the unavailability of one or more entities in the network, a corrupt local entity or a corrupt central server which may compromise the privacy of the framework. However, this work aimed to lay the groundwork for using federated learning in glucose prediction algorithms that can benefit patients. The authors plan to improve the proposed FedGlu by adding privacy guarantees in subsequent works and also tackle the challenge of unavailability among participating entities for improved learning.

Conclusion

In this work, we proposed a novel HH loss function that simultaneously improves predictions in both hypoglycemia and hyperglycemia regions without compromising overall predictions. This is critical as the HH loss function improves the glycemic excursion detection capabilities by an average of 78% across two datasets (a total of 125 patients) as compared to the standard MSE loss. The results were consistent with a proprietary dataset and publicly available OhioT1DM datasets. We also demonstrated the consistency of the HH loss function via FedGlu, a machine learning through a collaborative learning approach. FedGlu clearly outperforms the local models (35% improved glycemic excursion detection capabilities) while coming close to central model performances. These results prove the need to develop machine learnings models with strong predictive capabilities and ensure privacy for sensitive patient healthcare data.