Fig. 5: Analytical pipeline, a step-by-step process of the study’s analytical approach.
From: Unveiling overeating patterns within digital longitudinal data on eating behaviors and contexts

a Conceptual framework for an integrated ML pipeline aimed at identifying and addressing overeating phenotypes in practice. The pipeline begins with data collection, integrating multimodal sources such as sensor data, EMA data, and dietary recalls to capture a comprehensive view of eating behaviors. This is followed by preprocessing to clean and harmonize the data, ensuring consistency across modalities, and feature extraction to derive key indicators such as chew rate, bite frequency, and emotional states associated with eating episodes. The next stage involves the development of ML models for overeating detection and phenotype ideation. Supervised learning models identify key features predictive of overeating episodes, leveraging behavioral and psychological features, while clustering techniques group individuals into distinct overeating phenotypes based on shared behavioral and contextual patterns. Once phenotypes are characterized, they can be integrated into personalized treatment strategies, tailoring interventions to address specific overeating patterns (context- and behavior-driven overeating). These treatments may include real-time feedback systems to prompt users to reflect on their behaviors, along with recommendations for sustainable behavioral changes. The framework culminates in system deployment, where real-time feedback and monitoring enable continuous assessment of eating behaviors and treatment efficacy. Data from deployed systems can feed back into the pipeline, enabling refinement and validation of models and interventions over time. This iterative process supports the practical application of overeating phenotype identification and management in real-world settings, creating a closed-loop system for adaptive and effective health interventions. b Methodological approach used in this study. The process begins with data preparation, including the labeling and validation of meal times from sensor data, integration of psychological and contextual factors from EMA data and 24-hour dietary recalls. Overeating detection is performed using supervised models, incorporating SMOTE for imbalanced data and Bayesian optimization for fine-tuning. A semi-supervised clustering approach identifies overeating phenotypes, leveraging a non-linear encoder, UMAP for dimensionality reduction, and K-means clustering with z-score analysis for phenotype characterization. Evaluation metrics include AUROC, AUPRC, and Brier score loss for model performance, SHAP for interpretability, and clustering metrics such as silhouette score, homogeneity, and entropy.