Introduction

Accurate assessment of eating behavior is crucial for understanding dietary patterns and energy balance; however, traditional self-report methods are limited by recall bias, underreporting, and participant burden1. Continuous glucose monitoring (CGM) provides an objective means of detecting eating events by capturing postprandial glucose dynamics at 1- to 5-min intervals2. While originally developed for diabetes management and artificial pancreas systems3,4,5,6,7,8,9, CGM is increasingly used in research and among healthy individuals to monitor eating behavior and support digital nutrition applications10,11.

Multiple meal detection algorithms (MDAs) have been proposed to detect meal onsets from CGM data3,11,12,13,14,15,16,17. When restricted to CGM-only input, these include rate-of-change (ROC) detectors3,15, fuzzy logic18, observer-based models16, supervised machine learning11,13, and physiology-based glucose–insulin modeling12,14. However, published MDAs have typically been evaluated in isolation, often using different datasets, study populations, and evaluation criteria, which limits their comparability.

Standardized metrics such as sensitivity, false positives per day (FP/day), and detection time (Δt) have been recommended for performance assessment2. Reported sensitivity values > 90% and < 1.5 FP/day have been achieved by several MDAs7,12,13,14,15,16,18, while detection times often exceed 30 min3,7,13,14,16,18. Nevertheless, few studies have applied these metrics to systematically compare multiple CGM-only MDAs under free-living conditions using a shared dataset, limiting insight into relative performance trade-offs7,13,19.

To address this gap, we systematically implemented and validated nine published CGM-only MDAs using CGM data from young, healthy adults in free-living conditions. We focused on CGM-only approaches because our primary interest was behavioral meal detection for nutrition monitoring and digital interventions, where methods based on a single input stream may offer greater practicality and scalability than multimodal approaches. This focus is also relevant in healthy, normoglycemic individuals, in whom smaller postprandial glucose excursions make meal detection more challenging. We evaluated performance using standardized metrics within a participant-level holdout design. By benchmarking diverse detection approaches on the same dataset and under identical evaluation conditions, we aimed to provide a reproducible comparison of algorithm performance and practical guidance for selecting appropriate MDAs depending on application priorities. Based on prior findings, we expected model-based and pattern-recognition classifiers to balance sensitivity and detection latency more effectively than ROC-based methods.

Results

Hyperparameter tuning

Across validation sets, F2-scores ranged from 0.60 to 0.86 (Table 1). The highest scores were observed for MDADassau-2of3 and MDASamadi (both 0.86), whereas MDADassau-3of4 showed the lowest score (0.60). Most algorithms (MDADassau-2of3, MDADassau-3of4, MDAFaccioli, MDAHarvey, MDASamadi, and MDATurksoy) were tuned on a per-participant basis, while MDAKölle-Ra, MDAKölle-CGM, and MDAPopp were tuned using global hyperparameters. All final hyperparameter configurations derived from the validation phase were fixed prior to performance testing.

Table 1 Tuned hyperparameters and best F2 scores.

Performance on the test set

Table 2 summarizes algorithm-level performance across 216 meals in the test set. Sensitivity was highest for MDASamadi (89.8%), followed by MDAPopp (82.9%), and MDAKölle-CGM (82.4%). Intermediate performance was observed for MDAKölle-Ra (77.3%), MDATurksoy (76.9%), MDADassau-2of3 (72.2%), MDAHarvey (70.4%), and MDAFaccioli (64.4%), whereas MDADassau-3of4 showed the lowest sensitivity (49.1%).

Table 2 Performance metrics of meal detection algorithms.

FP/day were lowest for MDADassau-3of4 (0.12), MDATurksoy (0.22), and MDAHarvey (0.26), moderate for MDAKölle-Ra (0.33), MDAKölle-CGM (0.39), and MDADassau-2of3 (0.60), and highest for MDAPopp (1.28), MDAFaccioli (1.39), and MDASamadi (2.42).

Δt was shortest for MDADassau-3of4 (36.8 min), MDAHarvey (37.3 min), and MDADassau-2of3 (37.6 min), with intermediate times for MDAFaccioli (39.6 min), MDATurksoy (40.7 min), MDAKölle-Ra (41.7 min), and MDAKölle-CGM (43.8 min). MDASamadi (58.5 min) and MDAPopp (60.5 min) showed the longest detection delays.

Figure 1 illustrates participant-level trade-offs between sensitivity and FP/day, with point size reflecting detection time. Supplementary Table 2 shows a comparison of our observed performance values with those originally reported for each algorithm.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Participant-level sensitivity versus false-positives per day by meal detection algorithm. Each data point represents one participant.

Mixed-effects model results

The estimated marginal means (EMMs) with 95% CI for each algorithm are displayed in Table 3. Complete pairwise comparisons across all algorithms are provided in Supplementary Table 3.

Table 3 Estimated marginal means for each meal detection algorithm.

Sensitivity model (binomial GLMM)

Fixed and random effects explained 30.7% and 20.3% of the variance, respectively. Compared with MDAKölle-CGM (85.2% [CI 79.1–89.8]), MDADassau-3of4 (48.9% [CI 40.3–57.6], OR = 0.166, p < 0.001), MDAFaccioli (66.4% [CI 57.9–73.9], OR = 0.342, p < 0.001), and MDAHarvey (73.0% [CI 65.0–79.6], OR = 0.468, p = 0.050) exhibited significantly lower sensitivity. No significant differences were observed for MDADassau-2of3 (74.9% [CI 67.2–81.3]), MDAKölle-Ra (80.2% [CI 73.2–85.7]), MDAPopp (85.7% [CI 79.6–90.1]), MDASamadi (92.0% [CI 87.4–95.0]), or MDATurksoy (79.7% [CI 72.7–85.3]). Participants in the StandardDiet group had higher detection odds than those in the LCD group (β = 0.607, p = 0.020; OR = 1.84), indicating that reduced postprandial glucose excursions under carbohydrate restriction impaired meal detectability across algorithms. Weight was negatively associated with sensitivity (β = − 0.262, p = 0.045; OR = 0.77 per 1 SD increase; mean weight = 67.4 kg, SD = 10.0 kg). No significant effects were observed for phase, sex, height, or daily carbohydrate intake. No interactions were significant.

False positive model (Poisson GLMM)

Fixed and random effects explained 51.3% and 0.9% of the variance, respectively. Relative to MDAKölle-CGM (0.38 FP/day [CI 0.26–0.56]), higher FP/day were observed for MDAFaccioli (1.37 [CI 1.11–1.69], IRR = 3.57, p < 0.001), MDAPopp (1.26 [CI 1.02–1.57], IRR = 3.29, p < 0.001), and MDASamadi (2.39 [CI 2.02–2.82], IRR = 6.21, p < 0.001). Daily carbohydrate intake was positively associated with FP/day (β = 0.205, p = 0.005; IRR ≈ 1.23 per 1 SD [≈ 99 g] increase in carbs). No other fixed effects or interactions were significant.

Detection time model (LMM)

Fixed and random effects accounted for 19.8% and 17.6% of the variance. Compared with MDAKölle-CGM (43.8 min [CI 38.9–48.8]), faster detection was observed for MDADassau-2of3 (37.4 min [CI 32.4–42.4], β = − 0.324 SD, absolute difference − 6.4 min, p = 0.017), MDADassau-3of4 (37.2 min [CI 31.9–42.4], β = − 0.338 SD, absolute difference − 6.6 min, p = 0.017), and MDAHarvey (37.6 min [CI 32.6–42.7], β = − 0.313 SD, absolute difference − 6.2 min, p = 0.022). Conversely, MDAPopp (59.1 min [CI 54.2–64.1], β = + 0.771 SD, absolute difference + 15.3 min, p < 0.001) and MDASamadi (59.1 min [CI 54.1–64.0], β = + 0.767 SD, absolute difference + 15.3 min, p < 0.001) were significantly slower. No significant effects or interactions were observed for other fixed covariates.

Discussion

This study systematically evaluated nine MDAs that rely solely on CGM signals, using free-living CGM data from young, healthy adults. The algorithms demonstrated distinct performance profiles, and no single approach performed best across all metrics, including sensitivity, FP/day, and Δt. Sensitivity ranged from 49 to 90%, FP/day from 0.12 to 2.42, and Δt from 37 to 61 min. Pattern-recognition classifiers by Kölle et al.13 and the glucose-insulin-model-based algorithm by Turksoy et al.14 provided the most balanced trade-offs, combining high sensitivity, low FP/day, and moderate Δt. In contrast, ROC-based detectors, as proposed by Dassau et al.3 and Harvey et al.,15 yielded the shortest detection times at the expense of reduced sensitivity.

Compared to their original publications, most algorithms required more permissive hyperparameters to perform well in this cohort. This likely reflects the smaller and earlier postprandial glucose excursions reported in healthy individuals compared with people with diabetes4,20,21. ROC-based detectors selected lower glucose thresholds and ROC criteria than originally proposed (e.g., MDADassau-2of3 selected glucose thresholds of 103.8 mg/dL vs. 150–220 mg/dL in the original work)3. MDAHarvey similarly adopted lower ROC limits and glucose minima15. Glucose-insulin-model-based methods also adapted toward more permissive decision thresholds, such as a lower estimated rate of appearance threshold in MDATurksoy14. MDAPopp selected a higher error tolerance during simulation-based fitting12. Algorithms, such as MDADassau-3of4, which selected hyperparameters at the boundary of the predefined grid, may have performed better with an expanded tuning range.

When evaluated on the test set, MDASamadi and MDAPopp achieved the highest sensitivity (92.0% and 85.7%, respectively), but at the cost of the slowest detection times (both 59.1 min) and substantially elevated FP/day (2.39 and 1.26). In contrast, MDAFaccioli and MDADassau-3of4 demonstrated poor sensitivity, suggesting limited applicability in practice, particularly in clinical settings where missed detections could lead to hyperglycemia14. Balanced performance was seen in the pattern-recognition MDAs by Kölle et al.13 and the glucose-insulin-model-based MDATurksoy14, all of which combined high sensitivity (> 79%), low FP/day (< 0.40), and moderate Δt (41–44 min). ROC-based approaches (MDAHarvey and MDADassau-2of3) achieved the fastest detection times (37–38 min) with moderate sensitivity (~ 73–75%) and relatively low FP/day (< 0.60).

According to thresholds proposed by Brummer et al.2 (≥ 90% sensitivity and < 1 FP/day as excellent), MDASamadi, MDAPopp, and MDAKölle-CGM met or approached the sensitivity benchmark, while MDAKölle-CGM, MDAKölle-Ra, MDATurksoy, MDADassau-2of3, MDADassau-3of4, and MDAHarvey met the FP/day criterion. However, all algorithms exhibited detection delays of 37 min or more, which remain suboptimal for real-time clinical decision support (e.g., artificial pancreas systems), for which a Δt below 20 min is considered desirable2. Despite previous reports suggesting earlier postprandial glucose peaks in healthy adults20,21,22, earlier detection was not observed in this cohort.

Algorithm selection should therefore depend on application-specific priorities. When prioritizing sensitivity, MDAKölle-CGM offers a strong choice. For contexts requiring low FP/day while preserving moderate-to-high sensitivity, MDAKölle-Ra and MDATurksoy are preferable, with the latter further advantageous when training data are unavailable. If shorter detection latency is prioritized and moderate sensitivity is acceptable, MDAHarvey and MDADassau-2of3 are suitable options. In contrast, MDAPopp, MDASamadi, MDAFaccioli, and particularly MDADassau-3of4 appear unsuitable for use in young, healthy adults under free-living conditions.

Algorithm performance varied with dietary and anthropometric characteristics. Participants in the StandardDiet group had 84% higher odds of true-positive detection than those in the LCD group, likely reflecting the larger postprandial glucose excursions observed under higher carbohydrate intake23. Increased daily carbohydrate intake was also associated with higher FP/day (IRR ≈ 1.23 per 99 g increase), suggesting that prolonged glucose excursions may increase the likelihood of misclassification. In addition, higher body weight was associated with slightly reduced odds of detection (OR ≈ 0.77 per 10 kg increase in body weight). Although all participants were within a normal BMI range, this observation may be partly explained by the reported CGM bias in Abbott devices, which underestimates plasma glucose in overweight individuals24. These findings highlight that meal size, macronutrient composition, and interindividual physiological differences may influence CGM signal characteristics and, consequently, algorithm performance.

This study has several strengths. It is, to our knowledge, the first to directly compare nine CGM-only MDAs using a standardized holdout design under free-living conditions. Algorithm performance was evaluated across multiple metrics (sensitivity, FP/day, Δt), and mixed-effects models enabled robust comparisons accounting for day- and participant-level variability. This standardized evaluation framework supports reproducible benchmarking of CGM-only meal detection algorithms across studies. However, several limitations should be acknowledged. Meal logging adherence was inconsistent, with nearly half of all recorded days containing at least one unlogged meal, and afternoon snacks often missing. To avoid penalizing algorithms for detecting unlogged meals, afternoon periods were blinded, and a generous true-positive window was applied, likely inflating sensitivity and reducing FP/day. A wider true-positive window also increases the likelihood of matching delayed detections to logged meals. Although this may affect absolute performance estimates, the same window was applied to all algorithms and therefore does not alter the validity of the relative comparisons that were the primary focus of this study. Although this approach reduced the risk of classifying likely true but unlogged eating events as false positives, it may also have affected absolute performance estimates by suppressing valid detections. The present evaluation constraints also reduce real-world applicability, because practical implementations would not usually suppress detections during predefined daytime periods. CGM quality control allowed files with gaps of up to 120 min to maximize sample retention. Although files with longer continuous gaps were excluded, retaining shorter gaps may have increased the likelihood of FN and prolonged Δts. Algorithms were reimplemented without access to the original code, and certain logic (e.g., activation/deactivation in MDASamadi) was simplified. Hyperparameter grids were restricted, and several algorithms selected values near boundary limits. The modest number of participants also limits generalizability, although repeated observations across 201 valid CGM files and 603 meal events partly strengthened the robustness of the comparative analyses. Results are limited to young, healthy adults and may not generalize to populations with diabetes, obesity, or altered glucose dynamics. Only CGM-only MDAs were included, whereas multimodal or more integrative approaches may offer performance advantages by incorporating additional physiological or behavioral signals. We nevertheless focused on CGM-only methods because they are less burdensome, less complex, and potentially more scalable for behavioral monitoring and digital nutrition applications in free-living settings.

Emerging developments in CGM and machine learning suggest that automated detection may evolve beyond meal events toward broader metabolic pattern recognition25. Integrating glucose-derived features with data-driven and physiological models may improve detection accuracy and support more personalized, real-time monitoring. Future work should therefore examine unified frameworks that can inform both behavioral monitoring and earlier identification of metabolic changes26.

This study provides a standardized comparison of CGM-only meal detection algorithms under free-living conditions and demonstrates that algorithm performance depends strongly on the relative importance of sensitivity, false detections, and detection latency. No single MDA performed best across all metrics, underscoring the need to select algorithms based on application priorities rather than expecting a universal solution. The acceptable balance between false-positive and false-negative detections is highly context dependent. In behavioral monitoring, occasional false detections can be resolved through simple participant confirmation, making higher sensitivity more valuable than strict specificity. In clinical contexts such as automated insulin delivery, however, FP must be minimized because erroneous meal detections may trigger inappropriate insulin dosing. FN also carry context-dependent implications: in behavioral or nutritional monitoring they primarily reduce completeness, whereas in clinical or intervention settings they may delay appropriate glycemic responses or impair decision support. Algorithm choice should therefore weigh the relative costs of missed versus incorrect detections. Future research should extend validation of CGM-based MDAs to more diverse populations, including individuals with type 1 or type 2 diabetes, obesity, or altered glucose dynamics. Given the observed effects of diet group and daily carbohydrate intake, meal-level macronutrient composition should be examined as a moderator of detection accuracy. More reliable food-logging protocols are needed to reduce uncertainty in ground-truth labeling. Algorithm development should prioritize reducing detection latency while maintaining moderate to high sensitivity to support just-in-time adaptive interventions. Finally, multimodal approaches that incorporate additional physiological or wearable signals may offer further gains beyond CGM-only methods.

Methods

Study design and participants

The present work is a secondary analysis of anonymized data derived from a 21-day randomized, controlled dietary intervention, conducted independently of the original trial objectives (ClinicalTrials.gov identifier: NCT07429058). Using these data, we evaluated CGM-based MDAs under free living conditions across three 21-day study waves at the Technical University of Munich between late 2023 and early 2024. The study protocol was approved by the ethics committee of the Technical University of Munich (approval number: 447/21 S-KH), and all participants provided written informed consent prior to participation. All methods were performed in accordance with the Declaration of Helsinki. At the explicit request of the industry sponsor, the original intervention study was not registered in a clinical trial registry.

Participants were randomized to either a low-carbohydrate diet (LCD) or a standard diet (Standard Diet), with all procedures conducted within predefined eating windows (Fig. 2). Healthy adults aged 18–40 years, without diabetes mellitus and reporting at least one day of vigorous physical activity per week, were eligible. Participants with a BMI greater than 27 kg/m2 or with medical conditions affecting metabolism were excluded. Of the 22 enrolled participants, six were excluded due to insufficient valid CGM data, resulting in a final sample of 16 participants (6 males, 10 females; mean age, 25.9 years; BMI, 23.4 kg/m2). Participant characteristics and daily carbohydrate intake by group and study phase are presented in Table 4.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Study design of the dietary intervention.

Table 4 Participant characteristics.

Dietary intervention and schedule

The intervention included four dietary phases with standardized eating windows. All participants consumed their habitual (standard) diet during Baseline (Days 1–3). During the Diet Only period (Days 4–10), the LCD group consumed 75–125 g of carbohydrates/day, while the control group continued the standard diet. During the Diet and Caloric Restriction Period (Days 11–17), both groups adhered to their assigned diets with a daily energy deficit of ~ 500 kcal. During Washout (Days 18–21), participants returned to their habitual diet.

CGM measurements and meal logging

Interstitial glucose was recorded at 1-min intervals using FreeStyle Libre 2 sensors (Abbott Laboratories, USA), worn on the upper arm and paired with the FreeStyle LibreLink application. Each participant used one 14-day sensor followed by one 7-day sensor to cover all 21 study days. Participants were instructed to log all meals using the Supersapiens app (TT1 Products, Inc., USA).

Data quality control and final dataset

Each CGM file corresponded to one calendar day. Files were excluded if they contained fewer than three logged meals, had continuous gaps > 120 min, or belonged to participants with < 9 valid days. Files with shorter gaps were retained to preserve sample size and were processed together with the remaining CGM data during algorithm-specific resampling. After exclusions, 16 participants remained, contributing 201 valid CGM files and 603 meal events. The number of valid CGM files per participant ranged from 9 to 18. Further details on quality control, including file exclusion criteria, dataset composition, and the flow of participants and file inclusion and exclusion, are provided in the Supplementary Methods and Supplementary Fig. 1.

Selection of algorithms

Candidate MDAs were identified based on Brummer et al.’s2 scoping review and an updated literature search in PubMed and Google Scholar (January–June 2025). From 25 CGM-based MDA candidates, we excluded those requiring additional physiological inputs5,6,19,27,28,29,30,31,32,33,34,35, lacking implementation detail8,11,17, or not producing explicit meal onset times36,37. The final set included nine reproducible and CGM-only MDAs: two voting-based detectors by Dassau et al. (MDADassau-2of3 and MDADassau-3of4)3, a super-twisting observer by Faccioli et al. (MDAFaccioli)16, the Glucose Rate Increase Detector (GRID) approach by Harvey et al. (MDAHarvey)15, two classifiers by Kölle et al. (MDAKölle-Ra and MDAKölle-CGM)13, a simulation-based detector by Popp et al. (MDAPopp)12, a fuzzy-logic method by Samadi et al. (MDASamadi)18, and a model-based algorithm by Turksoy et al. (MDATurksoy)14. Full selection criteria and decision logic, including a detailed description of each MDA, are provided in the Supplementary Methods.

Algorithm implementation

All MDAs were implemented in MATLAB R2025a (The MathWorks, Inc., USA). Each algorithm processed the preprocessed CGM time series and returned detected meal onset times. CGM files were resampled to 1-min intervals for MDADassau-2of3 and MDADassau-3of43, MDAPopp12, and MDATurksoy14, and to 5-min intervals for MDAFaccioli16, MDAKölle-Ra and MDAKölle-CGM13, MDAHarvey15, and MDASamadi18. This algorithm-specific preprocessing was chosen to preserve the intended operating conditions of each published method rather than imposing a uniform resampling scheme across all algorithms. Detailed algorithm logic, preprocessing workflows, decision rules, and mathematical formulations are provided in the Supplementary Methods, with all equations consolidated in Supplementary Table 1.

Training, validation, testing, and performance evaluation

A per-participant holdout design was selected because the primary goal was to evaluate algorithm performance under within-individual conditions, which also enabled a fair comparison of methods that require or allow participant-specific calibration. For each participant, CGM files were pseudo-randomly split into training (31.34%; 189 meals), validation (32.84%; 198 meals), and test sets (35.82%; 216 meals). Only the classifiers by Kölle et al.13 were trained. Hyperparameters were tuned on the validation set by maximizing the sensitivity-focused F2-score (β = 2)38, consistent with de Carvalho et al.11 Ties were resolved by preferring lower FP/day and then shorter Δt. Final performance was assessed on the unseen test set. Detections were matched to logged meal times using a 120-min true-positive (TP) window7,19. TPs were detections within this window; missed meals were FN. Δt was defined as the temporal difference between detection and logged meal start. Only detections occurring after the logged meal start were included; negative values were not considered. Unmatched detections were FP. A 120-min lockout followed each TP to prevent repeated detections. Detections between 22:00 and 07:00 were excluded due to nocturnal CGM noise3,7. Afternoon detections (from post-lunch to 15 min before dinner) were suppressed because afternoon snacks were generally not logged, leaving the ground truth uncertain during this period. This evaluation constraint was introduced to avoid penalizing algorithms for detecting potentially true but unlogged eating events.

Statistical analyses

For each algorithm, we calculated true positives (TP), FP, FN, sensitivity (= TP/(TP + FN)), FP/day, and Δt. Mixed-effects models were used to compare MDAs using day-level outcomes: Sensitivity was modeled with a binomial generalized linear mixed model (GLMM), FP/day with a Poisson GLMM, and Δt with a linear mixed-effects model (LMM). Fixed effects included algorithm, diet group (LCD vs. StandardDiet), phase, sex, daily carbohydrate intake, height, and weight. Participant and day were included as random effects. Continuous predictors were standardized (z-scores). Estimated marginal means (EMMs) with 95% confidence intervals (CI) were reported. Tukey-adjusted pairwise comparisons used MDAKölle-CGM (Supplementary Methods and Supplementary Table 1) as the reference, with p < 0.05 considered significant. Effect sizes were expressed as odds ratios (OR), incidence rate ratios (IRR), or standardized β coefficients. Analyses were conducted in RStudio 2025.05.1 (Posit Software, USA).