Background & Summary

Type 1 Diabetes (T1D) is a chronic autoimmune disorder that destroys pancreatic beta cells, resulting in a loss of insulin production and the body’s inability to self-regulate blood glucose levels (BGL)1. In the United Kingdom (UK), diabetes affects roughly 8% of the population, with approximately 10% of these cases classified as T1D, according to the Breakthrough T1D2.

Managing diabetes places a substantial financial burden on healthcare resources; nearly 10% of the annual budget of the National Health Service (NHS) in England and Wales is allocated to diabetes care in general3. However, when it comes to T1D, access to advanced technologies such as closed-loop insulin delivery systems is becoming more common in high-income countries, while remaining limited in low- and middle-income countries (LMICs) and among individuals without sufficient health insurance coverage4.

Chronic complications arising from poor glycaemic control significantly heighten health risks and mortality in PwT1D compared to non-diabetic individuals, making strict glycaemic control essential for mitigating this risk5. Consequently, management of T1D requires PwT1D to make multiple daily decisions on monitoring their BGL, administering correct insulin dosages and managing hypo- or hyper-glycaemia when they occur. Several types of technologies, such as wearable glucose sensors and insulin pump devices, have been developed to help PwT1D improve their glycaemic control while minimally impacting the patient’s quality of life6. However, despite this, less than 40% of PwT1D achieve the recommended level of glycaemic control required to reduce the risk of complications7.

Research indicates that reliable blood glucose predictions could significantly improve the quality of life of PwT1D8. Consequently, intelligent diabetes management systems require BGL prediction algorithms that accurately mimic daily glycaemic variability while responding to the spontaneity of everyday life6. These predictions must respond to the various factors that affect BGL9, including insulin administration, food intake (carbohydrate), physical activity, and sleep patterns10.

The current state-of-the-art technology for T1D management is automatic insulin delivery (AID), also known as closed-loop systems11. These systems automate insulin dosing based on continuous glucose monitoring; however, they can temporarily switch to manual mode—standard insulin pump operation—in cases of connectivity issues or specific manufacturer-defined conditions12. This study provides a comprehensive dataset capturing real-world data from PwT1D, which could also facilitate the development of algorithms to support clinicians in LMICs, where AID remains less prevalent, expensive, or unavailable.

Publicly available datasets with similar variables, such as HUPA-UCM13, Tidepool14, diaTribe15, and OhioT1DM16, provide valuable insights into diabetes management. However, the dataset presented here offers unique advantages as it offers a more comprehensive analysis of long-term glycaemic trends and lifestyle factors. Unlike the 14-day HUPA-UCM dataset, which relies on the Fitbit Ionic and lacks Metabolic Equivalent of Tasks (METs) and motion intensity tracking, our dataset spans three months and leverages the Garmin Forerunner 45 to capture detailed activity data, including step count, calories burned, distance, METs, motion intensity, and categorized activity types. This richer set of physiological parameters allows for more detailed insights into activity-related blood glucose variability. Additionally, while HUPA-UCM is limited to FreeStyle Libre 2 data, our dataset integrates data from multiple continuous glucose monitor (CGM) platforms (LibreView, Dexcom, and Medtronic). Similarly, the Tidepool dataset aggregates real-world diabetes data from CGMs, insulin pumps, and manual log entries, providing patient-centered insights. In contrast, the diaTribe dataset primarily focuses on educational and research-based data, often derived from surveys and expert analyses rather than structured numerical datasets. The OhioT1DM dataset, specifically designed for T1D research, includes CGM data, insulin administration, carbohydrate intake, and other physiological factors, making it an essential resource for machine learning applications in glucose prediction and personalised treatment strategies. Our dataset expands on the commonly used nutritional information for meals consumed by including carbohydrate, fat, protein, and fibre content, along with a simple descriptor of each meal. This allows for a more nuanced understanding of the impact of macronutrient composition on blood glucose dynamics. By combining the strengths of existing datasets and addressing their limitations, our dataset contributes to advancing diabetes research, improving predictive modelling, and enhancing patient care strategies.

Compared to prominent datasets such as OhioT1DM and HUPA-UCM, our dataset offers distinctive advantages in demographic diversity, temporal resolution, and multimodal richness. The OhioT1DM dataset comprises 12 adult participants (8 male, 4 female) but lacks BMI data and exhibits limited age variability. While it includes pump-based insulin delivery and structured meals, the coverage of continuous glucose monitoring (CGM) and physical activity data is inconsistent between its two cohorts, and detailed nutritional information is sparse. Notably, physical activity data in OhioT1DM is not available for the full day. The HUPA-UCM dataset, on the other hand, involves 10 individuals aged approximately 20 to 50 years, monitored over 14 days using Freestyle Libre and Fitbit devices. Although it emphasises physical activity, it lacks comprehensive insulin records, precise meal composition, and objective intensity metrics such as metabolic equivalents (METs) or gradient. Activity levels are based on Fitbit classifications, and both meal and insulin data are self-reported, without validation of temporal alignment. In contrast, our dataset comprises 17 participants, balanced by gender (10 female, 7 male), spanning a broader age range (23–70 years) and includes documented BMI values (20.3–36.5 kg/m2). It offers 12 weeks of high-resolution, objectively captured data across six modalities, including Garmin-derived step counts, intensity levels, and sleep staging. This unique combination of demographic breadth, longitudinal depth, and sensor-derived multimodal data provides an unprecedented opportunity for personalised modelling of glycaemic dynamics under free-living, real-world conditions.

This dataset includes participants using both multiple daily injections (MDI) and insulin pumps operating in open-loop mode. These insulin delivery methods differ in dosing flexibility and associated glycaemic outcomes, with pump users often exhibiting distinct Time in Range (TIR) profiles compared to MDI users—primarily due to the increased adaptability of pump therapy rather than automation, as shown in previous studies17,18. The inclusion of multiple delivery modalities supports the evaluation of algorithm performance across diverse real-world treatment scenarios and enables stratified analyses by reported delivery method, particularly when combined with continuous glucose monitoring (CGM) metrics. This is especially relevant in contexts where access to closed-loop systems remains limited. Variability in insulin delivery methods and TIR should be carefully considered when developing and validating predictive models using these data.

The longitudinal nature of the dataset presented here—spanning a 12-week (90-day) period—captures sustained patterns in blood glucose levels (BGLs) alongside relevant lifestyle factors. This extended duration aligns with the time frame over which HbA1c, a key biomarker for longer-term glucose control, is typically measured (8–12 weeks)19. Because HbA1c reflects average glucose over several weeks, rather than short-term fluctuations, the dataset offers a valuable resource for exploring how real-world behaviours and glucose trends may relate to HbA1c outcomes. Studies have shown that sampling bias in shorter periods, such as 10 days, can be as high as 47%, decreasing substantially to 26.4% after 30 days20, further highlighting the importance of longer-term data for accurate assessment and prediction. A 12-week data window meets most regulatory requirements in treatment evaluation and is a standard duration used in phase II clinical trials for diabetes drugs, allowing researchers to draw meaningful comparisons of interventions and their sustained effects on blood glucose management21.

Further distinctions of the dataset includes more precise insulin tracking, with separate basal and bolus data being provided, compared to HUPA-UCM that resamples insulin data at five-minute intervals, reducing accuracy. Our dataset also provides standardized nutritional data via Nutritics22, offering detailed macronutrient breakdowns (Fig. 8(a–d)), whereas HUPA- UCM records carbohydrates in “servings.” Additionally, HUPA-UCM sleep data is organized by night, with separate files that detail the start times and durations spent in each sleep stage based on Fitbit’s general sleep scores. In contrast, our sleep data provides a similarly structured stage breakdown but includes additional granular details about the transitions and specific dynamics of each sleep stage, offering a deeper insight into sleep patterns. These advantages make our dataset better suited for real-world diabetes management and artificial intelligence (AI)-driven glucose prediction, integrating a comprehensive range of parameters over a clinically relevant period. In contrast, HUPA-UCM, while useful for short-term glucose variation analysis, lacks the depth and granularity needed for extensive diabetes research.

Methods

Following a longitudinal observational design, with data collection spanned from 1 October 2023, to 3 September 2024. Participants were recruited online through the social media pages of the Interaction Analysis and Modeling Lab (IAM Lab) group, T1D-specific social media groups, email outreach, broad-reaching tweets, social media posts, and physical advertisements placed in several buildings on the University of Manchester campus and in nearby locations such as sports centres. Potential participants were given at least 24 hours to consider their involvement to ensure a non-coercive recruitment process. PwT1D over the age of 18, who had been living with T1D for more than two years and who used CGM, were invited to participate. Applicants were excluded from the data collection if they had additional conditions that impacted their nutritional intake or if they used medications that affected their sleep and/or physical performance. Detailed inclusion and exclusion criteria can be found in Table 1. After an initial screening, eligible participants were contacted for a face-to-face/online interview conducted by researchers from the study team. During this interview, participants received instructions on recording their nutritional intake and how to wear the smartwatch (Garmin Forerunner 45) to ensure comprehensive data collection, including sleep tracking. Informed consent was also obtained at this stage for accessing blood glucose sensor and insulin device platforms. The participants were then issued a smartwatch for the study and at the end of the data collection period, additional consent was obtained to access the watch data.

Table 1 Inclusion and Exclusion Criteria for Study Participants.

Ethical approval

This study was reviewed and approved by the University of Manchester Research Ethics Committee before data collection began (Ref: 2023-15687-29584). Data collection adhered to all legal requirements and followed the principles of the Declaration of Helsinki, Good Clinical Practice (GCP), and the UK Policy Framework for Health and Social Care Research 2017. All participants provided informed consent for their data to be published.

Data collection

At the beginning of the study, all participants were instructed not to change their lifestyle or make any adjustments to their food intake, physical activity, or sleep patterns. This ensured that the data collected reflected their usual behaviours without external influences.The 12-week period was selected to align with the timeframe reflected by HbA1c measurements, and data collection commenced shortly after participants had a clinically measured HbA1c value. This alignment ensured that consistent lifestyle patterns could be observed throughout a period directly relevant to long-term glucose control.

Blood glucose data collection

Participants were already using CGM sensors as part of their routine clinical care. and were linked to LibreView23, Dexcom Clarity24 and Medtronic Carelink platforms25. Participants provided informed consent for their CGM data to be downloaded and analysed.

Insulin data collection

Participants on insulin pump (Tandem t: slim X226, MiniMed 780 G25, and Omnipod 512) as part of their routine clinical care had insulin delivery data downloaded from their respective device platforms and exported in CSV format. These files contained detailed information, including timestamps, insulin types (e.g., bolus or basal), and dosage amounts. Those on multiple daily insulin pen injections (MDI) electronically recorded their insulin data on platforms, such as the FreeStyle LibreLink app27 and Dexcom G628, which were then exported in CSV format.

Nutrition data collection

Participants were able to choose from two methods to record food intake. The first option was to use the mobile application, MyFitnessPal29, which allows for commonly consumed foods to be logged and tagged with associated recipes or ‘my foods’ options. Alternatively, participants could opt for a manual food diary, which required recording the time of the meal, meal type, foods consumed, estimated carbohydrate content, and insulin administered, alongside the corresponding food tag for each meal. Given the variability in how food diaries were maintained, all entries were standardised using Nutritics22 to ensure consistency in nutritional analysis across datasets. Participants were instructed to comprehensively document their food intake along with a food tag, a single descriptive word for each meal, for the first two weeks of the data collection period. Thereafter, only the food tag and detailed information on any newly introduced foods were required for each meal and snack. Each tag was unique to a specific meal and could be any identifier, provided it consistently referred to the same food item. For instance, different types of breakfast cereal would require distinct tags, such as ‘cornflakes’ or ‘muesli.’ The tags could be hyphenated (e.g., ‘fruit-yogurt’) and did not need to fully describe the food consumed, being as simple as ‘breakfast1’ or ‘breakfast2’ providing they were specific to a particular meal or snack.

Activity data collection

Participant activity data was collected using the Garmin Forerunner 45 smartwatch. A custom API was developed for this project (Bilal, A., https://iam-research.manchester.ac.uk/flaskapp/) to seamlessly integrate with the Garmin Connect Developer Program, enabling real-time, historical, and batch data retrieval. This API captured detailed raw activity data, including HEALTH - Epochs, which provides a structured time-series dataset. Developed using a Python framework, the API was deployed on The University of Manchester server. Participants received a secure access link and provided consent via the Garmin API Developer Platform to share their data for collection and analysis.

Sleep data collection

Each participant was instructed to wear the Garmin Forerunner 45 continuously throughout the study, including during sleep, removing it only for charging. Sleep data were retrieved through the Garmin Connect app using a dual approach to ensure comprehensive data collection. Participants first downloaded and shared all sleep-related data directly from their Garmin Connect app accounts. Additionally, sleep data were accessed via the Garmin Connect API, with participants consenting to share their smartwatch data through their Garmin Connect credentials. This approach ensured the collection of detailed sleep metrics, including sleep stages, while maintaining data security and participant privacy.

Data Records

The dataset is available in a Zenodo repository30 ‘T1D-UOM – A Longitudinal Multimodal Dataset of Type 1 Diabetes’ at; https://doi.org/10.5281/zenodo.15806142.

Participant information

Twenty-one participants were initially recruited, however, four withdrew due to personal reasons. Data from the remaining 17 participants were available for the final analysis. Table 2 outlines the information regarding the participants’ demographics, start and end dates of data collection, and devices used.

Table 2 Participant demographics.

Certain data were unavailable from some participants due to technical issues or failure to submit the required information during the study period. Data from one participant (UoM2303) was unable to be collected as the participant had an unplanned trip abroad during the study period.

Figure 1 illustrates the age distribution of participants. A balanced age range enhances the dataset’s applicability for research on T1D management, including personalized glucose prediction models, HbA1c trends, and the impact of lifestyle factors on long-term glycaemic control. By including both younger and older participants, researchers can develop AI-driven models that generalize across different age groups, accounting for variations in insulin sensitivity, metabolic rates, and physical activity patterns.

Fig. 1
figure 1

Age distribution of study participants. The figure illustrates the range and frequency of ages in the dataset, providing insight into the demographic composition of the cohort. Most participants were in the 20–30 age bracket, with only one participant aged 70 years.

To explore the metabolic diversity within the dataset, Fig. 2 provides valuable context. This bubble graph illustrates the relationship between age and BMI, highlighting variation in body composition across participants. Such differences may influence insulin resistance, which is a key factor in personalised T1D management. Insights from this data can inform strategies for optimising insulin dosing, dietary recommendations, and physical activity guidelines tailored to different age groups.

Fig. 2
figure 2

BMI vs. Age distribution of study participants. The figure displays the relationship between body mass index (BMI) and age, illustrating variations in body composition across different age groups. The youngest participant (23 years) has a BMI of 20.3 kg/m2, while the oldest participant (70 years) has a BMI of 25.7 kg/m2. The highest BMI in the dataset is observed in participant UoM2309 (36.5 kg/m2), whereas the lowest BMI is recorded for participant UoM2303 (20.3 kg/m2).

The descriptive statistics of BGLs and activity information for all participants are provided in Tables 3, 4, respectively.

Table 3 Descriptive statistics of glucose levels for all participants, including mean glucose values, standard deviations, number of recorded glucose readings, and calculated days.
Table 4 Descriptive statistics of activity levels for all participants, including mean values, standard deviations, number of recorded activity readings, and the duration of observation in days.

This includes mean values, standard deviations, number of recorded readings, and duration of observation in days.

Dataset structure

The complete dataset is outlined in Table 5, which provides information of each subfolder in the UoMT1D Dataset folder. All files are in the comma-separated value (CSV) format, using a comma as the delimiter with UTF-8 encoding.

Table 5 Dataset structure.

Blood glucose data

Table 6 shows the glucose data files overview. The UoMGlucoseID.csv is the file that includes two fields describing blood glucose data. The bg_ts field records the exact time of each observation in the format MM/DD/YYYY HH:MM, providing high-resolution temporal data crucial for monitoring blood glucose trends over time. The value field specifies the blood glucose reading as a floating-point value, measured in mmol/L, offering precise quantification of glucose levels. Table 7 provides example BGL data.

Table 6 UoM Blood Glucose Data Description.
Table 7 Blood glucose data example of UoM2301.

Insulin data

The UoMBasalID.csv file includes three key fields describing basal insulin data, as shown in Table 8. The basal_ts field records the timestamp of each observation in the format MM/DD/YYYY HH:MM, capturing both date and time to enable precise temporal analysis. The basal_dose field specifies the basal insulin rate as a floating-point value, with units represented as either U (units) for participants using long-acting insulin or U/h (units per hour) for those using rapid-acting insulin, providing essential information on dosage for therapy monitoring. The insulin_kind field identifies the type of insulin administered, with possible values R (rapid-acting insulin) and L (long-acting insulin), facilitating differentiation between formulations used in treatment. Table 9 provides example basal insulin data. For an overview of which participants use each insulin type, see Table 21.

Table 8 UoM Basal Data Description.
Table 9 Basal Insulin Data Example of UoM2301.

UoMBolusID.csv files includes two fields describing bolus insulin data, as shown in Table 10. The bolus_ts field captures the timestamp of each observation in the format MM/DD/YYYY HH:MM, enabling precise tracking of the timing of bolus insulin administration. The bolus_dose field specifies the bolus insulin dose as a floating-point value, with units recorded as U (units), providing detailed information on the administered dosage. Table 11 provides example bolus insulin data.

Table 10 UoM Bolus Data Description.
Table 11 Bolus Insulin Data Example of UoM2301.

Nutrition data

The UoMNutritionID.csv file provides detailed information about nutritional data through six key fields, as shown in Table 12. The meal_ts field records the datetime of the observation in the format MM/DD/YYYY HH:MM, enabling precise tracking of meal timing. The meal_type field specifies the type of meal, with possible values including Breakfast, Lunch, Dinner, and Snack, allowing for categorisation of dietary intake. The meal_tag field briefly describes the food eaten, offering additional context about the meal’s composition. The carbs_g field quantifies the amount of carbohydrates consumed in grams, while the prot_g field records the amount of protein consumed in grams. Similarly, the fat_g and fibre_g fields measures the fat and fibre content of the meal, respectively, in grams. Table 13 provides example nutrition data.

Table 12 UoM Nutrition Data Description.
Table 13 Nutrition data example of UoM2301.

Activity data

The UoMActivityID.csv file captures activity-related information using twelve fields, as shown in Table 14. The activity_ts field records the precise datetime of each observation in the format MM/DD/YYYY HH:MM, enabling accurate tracking of activities. The activity_type field describes the type of activity, with possible values including SEDENTARY, WALKING, RUNNING, and GENERIC, where GENERIC refers to other forms of physical exertion not explicitly categorized—such as cycling, gym workouts, or swimming. The active_Kcal field quantifies the calories burned during active periods, measured in kilocalories (Kcal). The step_count field records the number of steps taken, while the distance_m field measures the distance covered during the activity in meters. The duration_s field represents the total duration of the activity in seconds, complemented by the active_time_s field, which specifies the duration of active periods within the activity.

Table 14 UoM Activity Data Description.

The start_time_s field denotes the start time of the activity in seconds since a reference point, and the start_time_offset_s field provides the offset from the reference start time. The met field indicates energy expenditure in METs, representing the intensity of physical activity relative to resting levels. The intensity field categorises the activity’s intensity level as SEDENTARY, ACTIVE, or HIGHLY_ACTIVE. Additionally, the motion_intensity_mean and motion_intensity_max fields measure the average and maximum motion intensity during the activity, respectively. Figure 3 illustrates the distribution of the total number of steps and distance covered by the participants. These variations highlight differences in individual lifestyles, fitness levels, and their potential effects on glucose metabolism. Understanding this distribution is essential for analysing how activity levels influence glycaemic control and insulin sensitivity in the management of T1D. Table 15 provides example activity data.

Fig. 3
figure 3

Distribution of activity metrics across participants. The figure presents the variability in physical activity levels among participants, measured in terms of step count and distance covered.

Table 15 Activity data example of UoM2301.

Sleep data

The UoMSleepID.csv file provides physiological and activity-related data through seven fields, as shown in Table 16. The Timestamp field captures the precise datetime of observation in the format MM/DD/YYYY HH:MM:SS, enabling detailed temporal analysis. The heart_rate field records the heart rate in beats per minute (bPm), offering insights into cardiovascular activity. The curren_activity_type_intensity field quantifies the intensity of the current activity as a count, while the stress_level_value field indicates the individual’s stress level on a scale. The steps field tracks the number of steps taken, serving as an indicator of physical activity during active periods.

Table 16 UoM Sleep Data Description.

The sleep_level field represents sleep or awake status, with possible values of 0 for sleep and 1 for awake, facilitating the analysis of rest patterns. Lastly, the resting_heart_rate field measures the heart rate during rest in beats per minute (bPm), offering a baseline for understanding variations in heart activity. Table 17 provides example sleep data.

Table 17 Sleep data example of UoM2301.

The UoMsleeptime.csv file provides more detailed and comprehensive sleep-related physiological data through fifteen fields, as shown in Table 18. The calendar_date field records the date of the sleep session in the format MM/DD/YYYY, enabling temporal analysis of sleep patterns. The start_date_ts field captures the precise start timestamp of sleep in MM/DD/YYYY HH:MM format, allowing for detailed time-based evaluations. The duration_in_sec field quantifies the total sleep duration in seconds, offering insight into overall sleep length.

Table 18 Sleep Time Data Columns and Descriptions.

In contrast, the data presented in UoMSleepID.csv focuses on higher-frequency, real-time observations such as heart rate, step count, and binary sleep/awake states. This provides a broader yet less granular view of nightly sleep architecture compared to the staged breakdown offered by UoMsleeptime.csv.

The dataset UoMsleeptime.csv further categorizes sleep into distinct stages. The deep_sleep_s, light_sleep_s, and rem_sleep_s fields respectively capture the duration spent in deep sleep, light sleep, and REM sleep, all measured in seconds. Additionally, the awake_s field records the duration spent awake during the sleep session, facilitating the identification of wake periods. The unmeasurable_sleep_s field accounts for time intervals where sleep data could not be measured. To provide a structured representation of sleep cycles, the sleep_levels_map_deep, sleep_levels_map_light, sleep_levels_map_rem, sleep_levels_map_awake, and sleep_levels_map_unmeasurable fields contain time-segment mappings in object format, rep- resenting different sleep states at various timestamps. These mappings help in analysing sleep structure and transitions between sleep stages. Lastly, the validation field indicates the validation status of the sleep data, with possible values such as ENHANCED_FINAL and ENHANCED_TENTATIVE, signifying the reliability and accuracy of the recorded sleep session. Table 19 provides example sleep time data.

Table 19 Sleep time data example of UoM2301 containing general sleep data, sleep durations, and sleep level mappings.

Technical Validation

All data streams, including CGM (LibreView, Dexcom Clarity, CareLink), insulin delivery records, Garmin activity and sleep data, and nutritional logs, were timestamped by their respective devices. For 15 out of 17 participants residing in the United Kingdom, no time-zone conversion was necessary, as all devices were already synchronised to UK local time (GMT or BST, depending on the date). Timestamps were parsed and handled using Python’s pytz and datetime modules to ensure consistency across data modalities. Two participants (UoM2303 in Spain and UoM2320 in the Netherlands) remained abroad during their entire data collection period. Their devices were verified to be accurately synchronised with their respective local time zones, and as such, no adjustments were applied. Cross-modal temporal alignment was validated by checking for logical consistency across meal intake, insulin administration, glucose fluctuations, and physical activity. Garmin activity timestamps, initially in Unix epoch format, were converted to localised timestamps using participant-specific offsets. No clock drift or desynchronisation was identified.

Blood glucose data

All authors collaboratively processed the raw data to produce a cleaned dataset, ensuring data integrity and consistency across participants. The cleaning process involved multiple steps. First, the raw files were parsed based on consistent participant identifiers and timestamps. Next, the authors removed duplicate entries, corrected formatting inconsistencies (e.g. improperly formatted numbers and timestamps), and handled missing or anomalous values using imputation or removal, depending on context and severity. For instance, physiologically implausible glucose readings (e.g., negative values or values outside biologically reasonable ranges) were cross-checked with adjacent measurements.

Each dataset underwent a thorough completeness check to ensure that all expected fields were present for each observation window. Furthermore, the authors performed an inter-rater reliability assessment on the blood glucose data by having multiple team members visually inspect the time series for outliers or inconsistencies. To further validate the process, glucose values were randomly sampled and compared between the raw and cleaned datasets, verifying that no data points were unintentionally omitted or altered during cleaning.

To further ensure the accuracy of the time-in-range (TIR) calculations, the computed values for each participant were systematically cross-validated against the raw glucose data. For instance, participant UoM2301 had a calculated TIR of 79%, which matched the corresponding value derived directly from the raw data for the same period. This validation confirmed the consistency and reliability of the TIR results across the dataset. Minor discrepancies were noted in a few cases; for example, participant UoM2313 had a reported TIR of 49% based on LibreView raw data from 18/01/2024 to 31/01/2024, while the calculated TIR for that same period was 50.77%. Such differences were minimal and fell within an acceptable margin, further reinforcing the integrity of the cleaned dataset.

Activity data

To ensure the accuracy and reliability of the collected activity data, several validation steps were implemented. First, raw data from the Garmin Forerunner 45 was cross-checked against the data captured through the custom API to confirm synchronization and integrity. Additionally, a new column called activity_ts was created to convert the original start_time_offset_s (stored in Unix timestamp format) into a human-readable timestamp. To verify the integrity of this transformation, random timestamps were sampled for each participant and cross-checked against the original raw data to ensure the conversion was accurate. Sample data points from different time periods were manually reviewed and compared to the recorded activity logs to ensure that the activity data reflected the correct times and activities. Data completeness was also verified by checking that no critical periods of activity were missing from the collected dataset. These validation steps ensured both the accuracy and inter-rater reliability of the activity data.

The Fig. 4 illustrates the distribution of glycaemic control (TIR) and physical activity (daily step count) across participants. The inter-individual variability highlights the complex interplay between physical activity and glucose regulation, underscoring the need for personalised management strategies in T1D. Notably, higher physical activity did not uniformly translate to improved TIR, suggesting that additional factors such as insulin timing, nutrition, and individual insulin sensitivity may modulate these effects.

Fig. 4
figure 4

Participant-level variation in Time in range and daily step count. Higher physical activity did not consistently align with better glycaemic control, suggesting influence from additional factors such as insulin and nutrition.

To quantify this relationship, a Pearson correlation analysis was conducted between mean daily step count and TIR across participants. The analysis revealed a moderate positive association (r = 0.59, p = 0.02), suggesting that higher levels of physical activity tended to align with higher TIR as show in Fig. 5.

Fig. 5
figure 5

Positive correlation between mean daily step count and Time in Range (TIR). A moderate linear relationship was observed (Pearson r = 0.59, p = 0.02), suggesting increased physical activity may be associated with improved glycaemic control.

Sleep data

Sleep data were collected using the Garmin Forerunner 45, which employs motion detection and heart rate variability to estimate various sleep stages, including light, deep, and REM sleep. It should be noted that these devices, designed for general lifestyle monitoring and not as medical tools31, can have variable accuracy influenced by factors like device fit, the participant’s movements during sleep, and environmental conditions. Comparative studies show that under optimal conditions, Garmin’s sleep tracking is consistent with more specialized devices32,33. To verify the accuracy and reliability of our data, we compared sleep time and duration from participant-shared raw data with that retrieved from the API, finding no discrepancies, thus confirming the robustness of our data.

Table 20 presents key sleep metrics across participants, showing variability in total sleep time, mean sleep stages (REM, Deep, Light), and related physiological parameters such as mean glucose levels, sleep efficiency, and recorded stress level.The distribution of sleep stages as a percentage of total sleep time is presented in Fig. 6, alongside each participant’s TIR percentage, with sleep stages displayed as stacked bars and TIR represented by adjacent individual bars. Across the 12 participants, Light sleep included the largest segment of total sleep time in most individuals, while deep and REM sleep showed inter-individual differences. A negative correlation was observed between Time in Range (TIR %) and the different sleep stages as shown in Fig. 7.

Table 20 Sleep metrics summary per participant, including total sleep time (TST), mean sleep stages, mean glucose levels, and reported stress.
Fig. 6
figure 6

Bar chart shows each participant’s sleep stage distribution (Deep (blue), REM (green), and Light (red)) as stacked bars, alongside yellow bars representing time spent in target glucose range TIR (%).

Fig. 7
figure 7

The associations between Time in Range (TIR%) and durations of Deep Sleep, REM Sleep, and Light Sleep (in minutes) across participants. Each subplot presents a scatterplot with a fitted regression line and confidence interval, showing that TIR% tends to decrease slightly with increased duration in individual sleep stages.

Insulin data

Descriptive statistics—such as mean and standard deviation—were calculated to summarize key variables in the dataset. These statistical summaries were then compared to the corresponding self-reported values provided by participants (Table 21) to assess the consistency and validity of the collected data. The analysis revealed that, in all instances where data were available, the extracted basal and bolus insulin values aligned with the self-reported values. However, comparisons could not be made for participants UoM2303, UoM2308, UoM2309, UoM2320, and UoM2404 due to missing collected or unreported data. These cases are indicated by’N/A’ in the corresponding rows and columns of the table. This absence of data may have been due to participants encountering issues with device synchronization or failing to provide the required data during the study period. This level of incompleteness (3˜0%) is consistent with limitations reported in other publicly available datasets such as HUPA-UCM13, Tidepool14, diaTribe15, and OhioT1DM16, where gaps in insulin logging are common due to irregular reporting or device syncing issues. Instead of excluding participants with insulin data that was incomplete or not non-comparable to self-reported data, these patients are retained to preserve the richness of the dataset, particularly because other variables such as sleep, nutrition, and physical activity remain complete and may be valuable for investigating additional Blood Glucose patterns from other standpoints. A particularly notable case is UoM2304, who transitioned to a closed-loop system during the data collection, resulting in the second half of their data being produced by this device. Furthermore, cases where the standard deviation of insulin doses was 0 U correspond to participants who exclusively used long-acting insulin, denoted as “L”, indicating a consistent daily dosage. In contrast, participants using rapid-acting insulin (“R”) exhibited greater variability in dosage and thus did not show this pattern of zero standard deviation.

Table 21 Comparison of self-reported and extracted insulin data: includes reported basal/bolus insulin, collected means and standard deviations, insulin type, delay between report and collection, and carbs/insulin ratio.

To further assess the reliability of the computed values, an additional column was introduced to capture the number of days between questionnaire completion and the final data collection. This interval varied significantly between participants, ranging from 33 to 283 days. A longer time gap may lead to greater variations in the mean and standard deviation of the values reported by participants. Therefore, this factor should be considered when interpreting the data.

Nutrition data

To ensure consistency in nutritional analysis across datasets, all food diary entries were standardized using Nutritics22. Given the inherent errors in self-reported food intake, including under-reporting, misreporting, and recall bias34,35,36, meal tags were used to cross-check nutritional composition when meals were reported multiple times. Self-reported nutritional data, while inherently limited, is often the only feasible option in real-world data collection. When used alongside objective measures like insulin dosing and individualized insulin-to-carb ratios, it provides a practical and contextually validated approach to estimating dietary intake and assessing data reliability. Nutritics was chosen for its comprehensive meal planning, recipe analysis, and nutrient tracking capabilities. Its advantages include a robust database, customizable reports, multi-language support, and integration with wearable devices, making it a reliable dietary assessment tool37. The Meal Tag system further strengthened this approach by correlating postprandial glucose responses (PPGR) with entire meals rather than isolated nutrient components. This method provides a more holistic understanding of glycaemic impact, considering factors beyond carbohydrate counting, such as gut microbiome composition, stress levels, and hormonal fluctuations. Two participants were excluded from the nutritional analysis: UoM2401, who failed to return their manual food diary, and UoM2312, whose dietary patterns significantly changed due to religious reasons, violating the inclusion criteria. The meal specific average nutritional intake across the dataset can be seen in Fig. 9. This is further analysed for each participant can be seen in Table 22 and in Fig. 8.

Table 22 Summary of macronutrient composition across meal types, including mean and standard deviation (Std) values for carbohydrates (g), protein (g), fat (g), and fiber (g), categorized by participant.
Fig. 8
figure 8

Macronutrient distribution per participant for each meal. (a) Breakfast. (b) Lunch. (c) Dinner. (d) Snack.

Fig. 9
figure 9

The bar plot visualizes the average intake of four key macronutrients, carbohydrates, protein, fat, and fiber, across different meal types: Breakfast, Lunch, Dinner, and Snack. Each bar represents the mean grams consumed for a specific nutrient within each meal category. Overlaid on the bars are jittered dots representing individual participant data points, allowing for visualization of variability and distribution around the mean values.

In instances where self-reported meal or insulin data was missing, PPGR initiation times can be imputed based on matched events with similar contextual features, specifically, by aligning with entries from the same day of the week and similar meal type. The imputation process can use the closest available data point that reflected the average pattern of comparable matched events, ensuring contextual relevance. For example, a missing Monday breakfast entry from week 2 could be imputed by referencing isolated Monday breakfast data from weeks 1 and 3.

Preprocessing scripts used to isolate and model PPGRs, including steps for addressing missing or inconsistent logging, can be found at the Zenodo repository30 ‘T1D-UOM – A Longitudinal Multimodal Dataset of Type 1 Diabetes’ at https://doi.org/10.5281/zenodo.15806142.

Limitation

One potential limitation of this study is the relatively small number of participants (n = 17), which may limit the generalizability of findings, particularly for applications that rely on large and diverse training populations. However, the dataset provides dense, high-resolution multimodal data per individual—including glucose, insulin (basal and bolus), nutrition, activity, and sleep—collected continuously over a 12-week period. This richness supports the development of machine learning models that leverage temporal and contextual detail, such as recurrent neural networks (RNNs), transformer-based models, or personalized reinforcement learning approaches. These models can benefit significantly from the volume and granularity of data available per person, enabling them to learn complex intra-individual patterns relevant to T1D management. Future work will expand the cohort to include broader demographic diversity.