Introduction

Celiac disease (CD) is an immune-mediated disease characterized by small bowel enteropathy triggered by dietary exposure to gluten. The classic clinical presentation of CD includes signs of malabsorption and gastrointestinal symptoms such as abdominal pain, bloating and diarrhea1,2. However, many patients experience predominantly non-specific extra-intestinal symptomatology3,4, or are asymptomatic5. The diverse manifestations of CD can make it challenging for primary care providers (PCPs) to identify and diagnose CD in the general population6. Indeed, a majority of adult patients with CD today are likely undiagnosed5,6,7,8. Patients who eventually receive a diagnosis experience a mean delay of eleven years from symptom onset to diagnosis, with more than half reporting a delay of five or more years till the diagnosis is established6. CD has an estimated global prevalence of over 1%9, and a rising incidence in recent years6,10,11,12. Diagnostic delays and underdiagnosis of CD therefore represent an important healthcare problem13.

Screening for CD is typically performed by highly sensitive and widely available serum tests for CD autoimmunity (CDA): the most accurate and commonly used being the assay for antibodies to tissue-transglutaminase (tTG-IgA)14. Seropositivity is suggestive of underlying CD, and such patients typically undergo endoscopic evaluation to establish or rule out the diagnosis of CD via biopsy of intestinal mucosa15. High seropositivity (tTG-IgA > 10X ULN) is associated with a high (> 95%) positive predictive value (PPV) for villous atrophy and, in the proper clinical setting, is considered sufficient for diagnosis without a biopsy in children and possibly in adults7,14,15,16.

Clinical guidelines do not provide clear and consistent definitions on when to screen for CDA. General population screening is not currently recommended17, although screening high-risk patients may be warranted18,19. What defines high-risk groups for CD varies somewhat between reports, although the focus has typically been on a family history of CD and medical comorbidities with established associations with CD20. Beyond these factors, laboratory abnormalities such as iron-deficiency anemia are common among patients with CD21. While PCPs may be aware of specific risk-factors, signs and symptoms of undiagnosed CD, more subtle combinations of clinical features may go missed.

Machine learning (ML) algorithms have the potential to use existing data within a patient’s electronic medical record (EMR) to provide risk assessments to providers22. ML models have been developed to alert intensive care unit physicians to patients at risk of circulatory failure23, to identify clinically significant portal hypertension in non-alcoholic steatohepatitis patients from pathology reports24, to predict incident hypertension25 and hypertension outcomes26, to identify patients with undiagnosed psoriatic arthritis27, hepatitis C28, to predict dementia onset29, IgA nephropathy30, future Parkinson’s disease diagnosis31, and to flag patients at risk of advanced colorectal cancer32. One previous study that attempted to develop ML models to identify patients with incident CD using a variety of modeling methods found that the models were not consistently better than chance33. Another study showed positive results, but the study size was small and the models relied on symptoms extracted from unstructured clinical documents and diagnostic codes34. Neither study included objective laboratory test results as predictive input features, which may have hindered performance and limited generalizability.

The goal of the current study was to develop and assess a prescreening EMR-based tool to classify adult and adolescent patients by risk of having unidentified CD autoimmunity using commonly available clinical features. Five algorithms were trained and tested: logistic regression, decision tree, random forest, XGBoost and multilayer perceptron. Each algorithm was then assessed on discriminative ability as measured by estimated area under the ROC (AUC). Input features included age, biological sex and results from commonly available laboratory tests performed as part of complete blood counts and comprehensive metabolic, iron and lipid panels. Incident cases were identified from a large retrospective community-based dataset using results from tTG-IgA testing. Highly seropositive cases (tTG-IgA > 10X ULN) with probable underlying CD were used for model training and evaluation against cohorts of controls with no evidence of disease. Performance was additionally assessed for the highest performing model in a test set consisting of a cohort of seropositive cases (tTG-IgA > 2X ULN) who may require endoscopic evaluation for CD and a cohort of controls. In both test sets AUC was assessed at multiple time points before first documented evidence of CD autoimmunity.

Methods

Dataset

The dataset for this retrospective study consisted of deidentified EMR data from Maccabi Health Services (MHS), Israel’s 2nd largest health maintenance organization (HMO)35. Data were accessed via the Kahn Sagol Maccabi Research and Innovation Centre (KSM), and extracted using the MDClone platform (version 5.5.0.4; https://www.mdclone.com/), a proprietary software. The dataset was de-identified by KSM, and no personal identifying information was made available to the researchers. The dataset contains records from 2,963,864 unique patients with longitudinal data, as members rarely change HMOs. Unstructured data, including progress notes and pathology or procedure reports were not made available by KSM. The study therefore relied entirely on the structured data, specifically patient demographics and laboratory results. Approval for use of the dataset and the retrospective analysis was obtained by the Maccabi institutional review board (approval #0052-20-MHS), and the study was conducted in accordance with the Declaration of Helsinki and all relevant guidelines and regulations. Informed consent was waived as all identifying information had been removed by KSM.

Study cohort definitions

Patients were eligible for inclusion if they (1) were MHS members during the study period (2005–2021), and (2) joined MHS before 2005. The eligible population was randomly split into train (80%) and test (20%) sets. The split was performed by assigning patients with digits 2 or 6 as the third to last digit in their randomly generated hash ID to test, and all others to train.

In this study we distinguish between seropositive cases in general and highly seropositive cases with likely underlying CD. Cohorts of highly seropositive cases were selected in both train and test sets, and an additional cohort of seropositive cases was identified using case identification criteria (CIC) described below.

Highly seropositive cases were defined as patients having at least one documented tTG-IgA test ≥ 10X ULN, which has an extremely high (> 95%) PPV for duodenal biopsy proven CD7,36,37,38. CIC for seropositive cases included all patients with at least one documented tTG-IgA > 2X ULN. This definition is more sensitive for underlying CD, but also includes a higher proportion of cases that would not have pathologic evidence of disease7. Reports of procedures including endoscopic evaluations were not available to researchers, so pathologic evidence of CD could not be confirmed or ruled out.

To reduce the possibility that cases were not newly diagnosed, cases were excluded if they had a history of tTG-IgA levels within normal limits before their first positive tTG-IgA, or a CD diagnosis code > 1 year before their first positive tTG-IgA. Providers may order serology on patients with known CD to check patient compliance with a gluten-free diet (GFD), so such cases may have been previously diagnosed and therefore not incident cases relevant for this study. For each seropositive case, the earlier of the first positive serology or first diagnosis of CD was defined as that patient’s index date.

Screening serology was performed with one of two commercial kits used during the study period: Celiakey (Thermo Fisher Scientific-USA) for years 2005–2011 and Elia (Thermo Fisher Scientific-USA) for years 2012–2021, each with manufacturer established ULN values of > 5 U/mL, and > 7 U/mL respectively.

Controls were identified as eligible MHS patients with no documented CD diagnosis code and no serologic evidence of CD autoimmunity. One cohort of controls was selected to match the train set cases and two cohorts of controls in the test set were selected to match the two cohorts of cases (highly seropositive and seropositive). Controls were matched to cases by years of data availability at the maximal possible ratio of controls to cases. Controls were assigned the same index dates as the case to which they were matched. Controls were not matched by demographic characteristics to allow the model to learn the relationship between these features for predicting incident CD seropositivity.

Each patient included in the train or test sets was additionally assigned a run date, defined as the first of July of the calendar year preceding the patient’s index date. This gap between run date and index date is referred to as the one-year gap, referring to the mean time between run date and index dates of the cohort. The gap between the run date and the index date was added to account for potential clinical suspicion for CD immediately preceding the index date. Analyses were also conducted at further time gaps of up to four years prior to index dates to test the ability of the model to identify patients years before initial suspicion of disease. This methodology was described in detail in a previous report on patients with psoriatic arthritis27. Patients were additionally excluded if on their run date they did not meet the following criteria: (i) age ≥ 12 and age ≤ 85 years-old, (ii) members of MHS for at least four years, and (iii) at least one complete blood count (CBC) during the four years prior to their index date. The patient selection process is depicted in Fig. 1.

Fig. 1
figure 1

Flow chart of patient selection. Highly seropositive (tTG-IgA > 10X ULN) cases with no previous evidence of celiac disease seropositivity were identified and excluded if at their index date they were (i) age < 12 or age > 85 years-old, (ii) members of MHS for fewer than four years, and (iii) had no documented complete blood count (CBC) during the four years prior to their index date. Controls were matched to cases by years of data availability at the assigned run date where they met eligibility criteria.

Model development

Five candidate models were developed to classify patients as at-risk or not for having undiagnosed CD autoimmunity. Each model was fit to the training data using the following algorithms: logistic regression, decision tree, random forest, XGBoost and multilayer perceptron. Logistic regression estimates the log-odds of an outcome by linear combination of weighted input features. The estimate can then be converted into a probability using a logit function, which can be used for binary classification given a predefined cutoff. A decision tree performs classification by recursively partitioning based on the given set of input features. Random forest is a method that uses multiple decision trees, and then performs classification based on the result of the majority of the decision trees39,40. XGBoost builds multiple decision trees sequentially using gradient boosting to minimize the errors made by previous trees26,39,40,41. A multilayer perceptron is a type of artificial neural network that consists of an input layer, one or multiple hidden layers and an output layer40. Neurons of the input layer represent input features, which activate neurons in the hidden layer to produce an output from the output layer. This output can then be converted into a classification. The models were trained and evaluated using data available during the three years prior to the run date. Training was performed at a one-year time gap between run dates and index dates. All models were implemented using the scikit-learn library v1.342 and the scikit-learn compatible XGBoost library.

Candidate independent predictors consisted of basic demographic information (biological sex and age at run date), and laboratory test results extracted from the structured data (Supplementary Table 1). Laboratories included as features all individual components of the comprehensive metabolic panel and complete blood count with differential, ferritin and high-density lipoprotein (HDL) as patients with celiac often have iron deficiency anemia4 and low HDL43. The most recent available laboratory result of each type log-transformed by sex and age group. Missing data were given a special indication and the model treated these as null values. For the logistic regression, null values were replaced by median values for the feature as calculated by biological sex and age group. SHapley Additive exPlanations (SHAPs) were used both for model selection and to explain the output of the XGBoost model44.

Models were developed to provide each patient with an output score given the patient’s input features. The score can be converted at a given threshold to perform binary classification of positive (at-risk for undiagnosed CD) or negative (not at-risk for undiagnosed CD) classes. The predicted classes assigned to each patient are then evaluated against the ground truth labels established by the CIC as described above (i.e. incident case of CD autoimmunity or control with no documented evidence of CD).

The hyperparameters for each algorithm were selected by performing five-fold cross validation within the train set and optimizing for average precision (Supplementary Table 2). Each model was then retrained on the entire train set with the selected hyperparameters to produce one model with each of the five algorithms for evaluation on the test set.

Model selection and evaluation

To test the ability of the models to identify incident CD seropositivity prior to the first documented evidence of disease, the performance of each model was assessed one year prior to patients’ run dates in the test set with cohorts of highly seropositive cases and controls. The model with the highest AUC was then additionally evaluated at run dates of two-, three- and four-year time gaps prior to each patient’s run date, as well as at all four time gaps in the test set consisting of seropositive patients and controls. Tests at each time gap were performed on the same base cohort selected according to the criteria described above. Patients from each test cohort who did not meet eligibility criteria at previous time gaps were removed from analyses.

Results

Cohorts of cases of CDA and controls in the train and test sets are described in Table 1.

Table 1 Descriptive characteristics of the train and test cohorts. The first test cohort includes cases of celiac disease autoimmunity with high seropositivity (tTG-IgA > 10X ULN) and matched controls. The second test cohort includes all cases of seropositivity with tTG-IgA > 2X ULN.

Performance for each model on the high seropositivity test cohort at a gap of one year are shown in Fig. 2. At the one-year gap, discriminatory AUC was highest for the XGBoost model at 0.86, followed by the models using logistic regression (AUC = 0.85), random forest (AUC = 0.83), multilayer perceptron (AUC = 0.80) and decision tree (AUC = 0.77). Feature contributions for the XGBoost model are depicted for this test by SHAP analysis (Fig. 3). The XGBoost also had the best performance at the two-, three-, and four-year gaps, with AUC values of 0.83, 0.82 and 0.81 respectively (Supplementary Table 3).

Fig. 2
figure 2

Receiver operating characteristics (ROC) plot for identification of patients with highly seropositive (tTG-IgA > 10X ULN) celiac disease autoimmunity versus controls one year prior to first documented evidence of disease. The performance of five modeling modalities is compared: XGBoost (XGB), logistic regression, random forest, multilayer perceptron (MLP) and decision tree. Figure prepared with Matplotlip v3.8 (https://matplotlib.org/).

Fig. 3
figure 3

SHAP plot depicting contribution of the input features at the one year time gap prior to first documentation of celiac disease (CD) autoimmunity for the XGBoost model. Positive SHAP values contribute positively to classification of a patient as at-risk for undiagnosed CD autoimmunity. A higher value for a given feature is indicated in red, with lower values in blue. Biological sex was coded as a binary: 1 = Female; 0 = Male. Figure prepared with SHAP v0.42 (https://shap.readthedocs.io).

The models was then tested on the test set consisting of seropositive cases and controls. The XGBoost model’s ability to distinguish between cases and controls in this test set was assessed by AUC as 0.79, 0.77, 0.75, 0.75 at gaps of one, two, three and four years (Supplementary Table 4).

Discussion

In the current study, we describe the development and comparative evaluation of five ML models for identifying adult and adolescent patients with incident CD autoimmunity prior to the first documented evidence of disease. Based on AUC, XGBoost exhibited the strongest ability of the models to distinguish between cases of highly seropositive cases of CD autoimmunity and controls one year prior to initial documentation, followed by the models using logistic regression, random forest, multilayer perceptron and decision tree respectively. XGBoost is a particularly robust ML method for tabular data, which often produces the best performance compared to other ML methods26,30,39,40,41. This model showed excellent discriminatory ability (AUC > 0.80) between cases of highly seropositive patients with likely CD at gaps of one, two, three and four years. The model also showed good discriminatory ability (AUC > 0.7) at all time gaps for more broadly defined seropositive cases compared to controls. These findings suggest the potential utility of this model as a prescreening tool to identify patients at risk of having CDA for eventual evaluation for CD. The final model achieved these results using commonly available laboratory results and demographic features.

The relationships between demographic and lab features of the XGBoost model as depicted in the SHAP are consistent with established phenomena among patients with untreated CD. The model identified decreased hemoglobin, ferritin, mean cell hemoglobin (MCH), mean cell hemoglobin concentration (MCHC) and mean cell volume (MCV) as predictive of undiagnosed CD autoimmunity. These laboratory findings are indeed characteristic of anemia secondary to malabsorption of dietary iron and chronic inflammation in the setting of CD1. Increased liver function tests (LFTs), specifically alanine transaminase (ALT) and aspartate transferase (AST) and alkaline phosphatase contributed positively to classification as at-risk for the model. In patients with newly diagnosed CD, 20–50% of patients have elevated LFTs, and undetected CD accounts for an estimated 4% of cases of unexplained transaminitis45,46. Low high density lipoprotein (HDL) has frequently been associated with untreated CD47, and the combination of unexplained iron-deficiency anemia and low HDL may be particularly suggestive of CD48. For the model, low HDL contributed to a positive classification as at-risk for undiagnosed CD autoimmunity. Longitudinal studies have found that these lab abnormalities typically resolve after initiation of a GFD, including anemia49, transaminitis50, and low HDL51,52, further highlighting the importance of identifying undiagnosed cases of CD early.

Earlier identification and treatment of CD have been shown to have clinical benefits: screen-detected asymptomatic and mildly symptomatic adult patients typically show improved intestinal histology and reduced serum autoantibody levels after following adherence to a GFD53,54. Late diagnosis in contrast has been associated with unfavorable clinical outcomes. Patients diagnosed after 40 years of age are more likely than younger patients to show persistent signs of intestinal mucosal injury, including villous blunting and intraepithelial lymphocytes in the duodenum despite following a strict GFD55. Among symptomatic patients, diagnostic delays are associated with poorer long-term outcomes even after initiation of a GFD, including persistent gastrointestinal and extra-gastrointestinal symptoms3,4,56, increased utilization of healthcare resources, and lower reported quality of life measures8,57,58. Nevertheless CD is a highly heterogeneous disease, and the long-term benefits of early identification by screening should be established in future studies59.

To our knowledge, this is the first report of a ML model showing the ability to identify cases of incident CDA from controls within a large community-based setting using only commonly available laboratory results, biological sex and age in adults and adolescents. This tool may have clinical value as a prescreening tool to identify patients who should be evaluated for CDA. Patients who are found to be seropositive can then undergo further evaluation for CD according to clinical guidelines, including additional serologic testing or endoscopic evaluation with multiple biopsies60.

A major strength of the study was the size and continuity of follow-up in the longitudinal data set, which allowed for large patient cohorts with rich historical data. An additional strength is that by using commonly available lab results and demographic information, the model can be portably implemented in most EMR systems. Follow-up studies are needed to evaluate the robustness of the models to other datasets with different population demographics and EMR systems. Additionally, prospective studies should explore the PPV that a flagged patient is seropositive for CDA, and for undiagnosed CD. This model, if validated, may assist PCPs in identifying patients with CDA at point of care, or health systems identifying patients in their populations who may benefit from screening for CDA.

The study also had limitations. Due to lack of access to unstructured data such as clinical documentation, patients were selected by the proxy of test results rather than by endoscopy results. Some patients selected as cases with CDA may not have had CD. Diagnostic status as reported in clinical documents such as were not made available to establish CD status. Highly seropositive patients are very likely to have CD7. Some controls may also have CD that was not documented or undiagnosed at the time the study was conducted. Additionally, there may be heterogeneity in the full population that was not captured in the cohorts of cases and controls used for model training and evaluation. Future studies should examine the robustness of the model, and its ability to perform on different populations and health care systems that may differ from MHS through retrospective and prospective validation studies.

In conclusion, this study presents a ML model based on a large population dataset that can identify adults and adolescents at risk of undiagnosed CD autoimmunity using commonly available structured clinical and demographic data.