A Longitudinal Multimodal Dataset of Type 1 Diabetes

Alsuhaymi, Ashwaq; Bilal, Ahmad; García, Daniel Gasca; Kongdee, Rujiravee; Lubasinski, Nicole; Thabit, Hood; Nutter, Paul W.; Harper, Simon

doi:10.1038/s41597-025-05695-1

Download PDF

Data Descriptor
Open access
Published: 07 August 2025

A Longitudinal Multimodal Dataset of Type 1 Diabetes

Ashwaq Alsuhaymi¹,
Ahmad Bilal¹,
Daniel Gasca García ORCID: orcid.org/0000-0002-3944-3234¹,
Rujiravee Kongdee¹,
Nicole Lubasinski¹,
Hood Thabit²,
Paul W. Nutter¹ &
…
Simon Harper¹

Scientific Data volume 12, Article number: 1379 (2025) Cite this article

8222 Accesses
2 Citations
1 Altmetric
Metrics details

Subjects

This article has been updated

Abstract

People living with Type 1 Diabetes (PwT1D) must continuously monitor blood glucose levels and make critical clinical and safety-related decisions multiple times a day to maintain glycaemic control within recommended ranges. While significant efforts have been made to develop algorithms that assist PwT1D in managing blood glucose more effectively, access to automated insulin delivery (AID) systems remains highly variable across the world. Moreover, there is a lack of publicly available, comprehensive datasets necessary for developing algorithms to support scenarios where AID systems revert to manual mode. This study addresses this gap by providing a detailed, multimodal dataset encompassing five key aspects: blood glucose levels; basal and bolus insulin dosages; nutritional intake (carbohydrates, protein, fat, and fibre content); physical activity (step count, active calories, distance covered, MET, and intensity level); and sleep patterns. The dataset includes longitudinal (3-month) real-world data collected from 17 PwT1D participants. By making this resource available, the study aims to advance algorithm development and improve diabetes management, particularly in settings where AID technology is less accessible.

Multimodal AI correlates of glucose spikes in people with normal glucose regulation, pre-diabetes and type 2 diabetes

Article Open access 31 July 2025

Enhancing self-management in type 1 diabetes with wearables and deep learning

Article Open access 27 June 2022

A multimodal deep learning architecture for predicting interstitial glucose for effective type 2 diabetes management

Article Open access 29 July 2025

Background & Summary

Type 1 Diabetes (T1D) is a chronic autoimmune disorder that destroys pancreatic beta cells, resulting in a loss of insulin production and the body’s inability to self-regulate blood glucose levels (BGL)¹. In the United Kingdom (UK), diabetes affects roughly 8% of the population, with approximately 10% of these cases classified as T1D, according to the Breakthrough T1D².

Managing diabetes places a substantial financial burden on healthcare resources; nearly 10% of the annual budget of the National Health Service (NHS) in England and Wales is allocated to diabetes care in general³. However, when it comes to T1D, access to advanced technologies such as closed-loop insulin delivery systems is becoming more common in high-income countries, while remaining limited in low- and middle-income countries (LMICs) and among individuals without sufficient health insurance coverage⁴.

Chronic complications arising from poor glycaemic control significantly heighten health risks and mortality in PwT1D compared to non-diabetic individuals, making strict glycaemic control essential for mitigating this risk⁵. Consequently, management of T1D requires PwT1D to make multiple daily decisions on monitoring their BGL, administering correct insulin dosages and managing hypo- or hyper-glycaemia when they occur. Several types of technologies, such as wearable glucose sensors and insulin pump devices, have been developed to help PwT1D improve their glycaemic control while minimally impacting the patient’s quality of life⁶. However, despite this, less than 40% of PwT1D achieve the recommended level of glycaemic control required to reduce the risk of complications⁷.

Research indicates that reliable blood glucose predictions could significantly improve the quality of life of PwT1D⁸. Consequently, intelligent diabetes management systems require BGL prediction algorithms that accurately mimic daily glycaemic variability while responding to the spontaneity of everyday life⁶. These predictions must respond to the various factors that affect BGL⁹, including insulin administration, food intake (carbohydrate), physical activity, and sleep patterns¹⁰.

The current state-of-the-art technology for T1D management is automatic insulin delivery (AID), also known as closed-loop systems¹¹. These systems automate insulin dosing based on continuous glucose monitoring; however, they can temporarily switch to manual mode—standard insulin pump operation—in cases of connectivity issues or specific manufacturer-defined conditions¹². This study provides a comprehensive dataset capturing real-world data from PwT1D, which could also facilitate the development of algorithms to support clinicians in LMICs, where AID remains less prevalent, expensive, or unavailable.

Publicly available datasets with similar variables, such as HUPA-UCM¹³, Tidepool¹⁴, diaTribe¹⁵, and OhioT1DM¹⁶, provide valuable insights into diabetes management. However, the dataset presented here offers unique advantages as it offers a more comprehensive analysis of long-term glycaemic trends and lifestyle factors. Unlike the 14-day HUPA-UCM dataset, which relies on the Fitbit Ionic and lacks Metabolic Equivalent of Tasks (METs) and motion intensity tracking, our dataset spans three months and leverages the Garmin Forerunner 45 to capture detailed activity data, including step count, calories burned, distance, METs, motion intensity, and categorized activity types. This richer set of physiological parameters allows for more detailed insights into activity-related blood glucose variability. Additionally, while HUPA-UCM is limited to FreeStyle Libre 2 data, our dataset integrates data from multiple continuous glucose monitor (CGM) platforms (LibreView, Dexcom, and Medtronic). Similarly, the Tidepool dataset aggregates real-world diabetes data from CGMs, insulin pumps, and manual log entries, providing patient-centered insights. In contrast, the diaTribe dataset primarily focuses on educational and research-based data, often derived from surveys and expert analyses rather than structured numerical datasets. The OhioT1DM dataset, specifically designed for T1D research, includes CGM data, insulin administration, carbohydrate intake, and other physiological factors, making it an essential resource for machine learning applications in glucose prediction and personalised treatment strategies. Our dataset expands on the commonly used nutritional information for meals consumed by including carbohydrate, fat, protein, and fibre content, along with a simple descriptor of each meal. This allows for a more nuanced understanding of the impact of macronutrient composition on blood glucose dynamics. By combining the strengths of existing datasets and addressing their limitations, our dataset contributes to advancing diabetes research, improving predictive modelling, and enhancing patient care strategies.

Compared to prominent datasets such as OhioT1DM and HUPA-UCM, our dataset offers distinctive advantages in demographic diversity, temporal resolution, and multimodal richness. The OhioT1DM dataset comprises 12 adult participants (8 male, 4 female) but lacks BMI data and exhibits limited age variability. While it includes pump-based insulin delivery and structured meals, the coverage of continuous glucose monitoring (CGM) and physical activity data is inconsistent between its two cohorts, and detailed nutritional information is sparse. Notably, physical activity data in OhioT1DM is not available for the full day. The HUPA-UCM dataset, on the other hand, involves 10 individuals aged approximately 20 to 50 years, monitored over 14 days using Freestyle Libre and Fitbit devices. Although it emphasises physical activity, it lacks comprehensive insulin records, precise meal composition, and objective intensity metrics such as metabolic equivalents (METs) or gradient. Activity levels are based on Fitbit classifications, and both meal and insulin data are self-reported, without validation of temporal alignment. In contrast, our dataset comprises 17 participants, balanced by gender (10 female, 7 male), spanning a broader age range (23–70 years) and includes documented BMI values (20.3–36.5 kg/m²). It offers 12 weeks of high-resolution, objectively captured data across six modalities, including Garmin-derived step counts, intensity levels, and sleep staging. This unique combination of demographic breadth, longitudinal depth, and sensor-derived multimodal data provides an unprecedented opportunity for personalised modelling of glycaemic dynamics under free-living, real-world conditions.

This dataset includes participants using both multiple daily injections (MDI) and insulin pumps operating in open-loop mode. These insulin delivery methods differ in dosing flexibility and associated glycaemic outcomes, with pump users often exhibiting distinct Time in Range (TIR) profiles compared to MDI users—primarily due to the increased adaptability of pump therapy rather than automation, as shown in previous studies^17,18. The inclusion of multiple delivery modalities supports the evaluation of algorithm performance across diverse real-world treatment scenarios and enables stratified analyses by reported delivery method, particularly when combined with continuous glucose monitoring (CGM) metrics. This is especially relevant in contexts where access to closed-loop systems remains limited. Variability in insulin delivery methods and TIR should be carefully considered when developing and validating predictive models using these data.

The longitudinal nature of the dataset presented here—spanning a 12-week (90-day) period—captures sustained patterns in blood glucose levels (BGLs) alongside relevant lifestyle factors. This extended duration aligns with the time frame over which HbA1c, a key biomarker for longer-term glucose control, is typically measured (8–12 weeks)¹⁹. Because HbA1c reflects average glucose over several weeks, rather than short-term fluctuations, the dataset offers a valuable resource for exploring how real-world behaviours and glucose trends may relate to HbA1c outcomes. Studies have shown that sampling bias in shorter periods, such as 10 days, can be as high as 47%, decreasing substantially to 26.4% after 30 days²⁰, further highlighting the importance of longer-term data for accurate assessment and prediction. A 12-week data window meets most regulatory requirements in treatment evaluation and is a standard duration used in phase II clinical trials for diabetes drugs, allowing researchers to draw meaningful comparisons of interventions and their sustained effects on blood glucose management²¹.

Further distinctions of the dataset includes more precise insulin tracking, with separate basal and bolus data being provided, compared to HUPA-UCM that resamples insulin data at five-minute intervals, reducing accuracy. Our dataset also provides standardized nutritional data via Nutritics²², offering detailed macronutrient breakdowns (Fig. 8(a–d)), whereas HUPA- UCM records carbohydrates in “servings.” Additionally, HUPA-UCM sleep data is organized by night, with separate files that detail the start times and durations spent in each sleep stage based on Fitbit’s general sleep scores. In contrast, our sleep data provides a similarly structured stage breakdown but includes additional granular details about the transitions and specific dynamics of each sleep stage, offering a deeper insight into sleep patterns. These advantages make our dataset better suited for real-world diabetes management and artificial intelligence (AI)-driven glucose prediction, integrating a comprehensive range of parameters over a clinically relevant period. In contrast, HUPA-UCM, while useful for short-term glucose variation analysis, lacks the depth and granularity needed for extensive diabetes research.

Methods

Following a longitudinal observational design, with data collection spanned from 1 October 2023, to 3 September 2024. Participants were recruited online through the social media pages of the Interaction Analysis and Modeling Lab (IAM Lab) group, T1D-specific social media groups, email outreach, broad-reaching tweets, social media posts, and physical advertisements placed in several buildings on the University of Manchester campus and in nearby locations such as sports centres. Potential participants were given at least 24 hours to consider their involvement to ensure a non-coercive recruitment process. PwT1D over the age of 18, who had been living with T1D for more than two years and who used CGM, were invited to participate. Applicants were excluded from the data collection if they had additional conditions that impacted their nutritional intake or if they used medications that affected their sleep and/or physical performance. Detailed inclusion and exclusion criteria can be found in Table 1. After an initial screening, eligible participants were contacted for a face-to-face/online interview conducted by researchers from the study team. During this interview, participants received instructions on recording their nutritional intake and how to wear the smartwatch (Garmin Forerunner 45) to ensure comprehensive data collection, including sleep tracking. Informed consent was also obtained at this stage for accessing blood glucose sensor and insulin device platforms. The participants were then issued a smartwatch for the study and at the end of the data collection period, additional consent was obtained to access the watch data.

Table 1 Inclusion and Exclusion Criteria for Study Participants.

Subjects

Abstract

Similar content being viewed by others

Multimodal AI correlates of glucose spikes in people with normal glucose regulation, pre-diabetes and type 2 diabetes

Enhancing self-management in type 1 diabetes with wearables and deep learning

A multimodal deep learning architecture for predicting interstitial glucose for effective type 2 diabetes management

Background & Summary

Methods

Ethical approval

Data collection

Blood glucose data collection

Insulin data collection

Nutrition data collection

Activity data collection

Sleep data collection

Data Records

Participant information

Dataset structure

Blood glucose data

Insulin data

Nutrition data

Activity data

Sleep data

Technical Validation

Blood glucose data

Activity data

Sleep data

Insulin data

Nutrition data

Limitation

Code availability

Change history

28 August 2025

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links