Introduction

The U.S. FDA’s Sentinel System forms a critical component of the national active post-marketing surveillance of medical products1. Historically, Sentinel’s reliance on insurance claims data has led to insufficiency in addressing some emerging safety questions requiring more granular clinical information2. The FDA Sentinel Real-World Evidence Data Enterprise (RWE-DE), a data infrastructure linking large volumes of electronic health records (EHRs) with claims data, was created to help the FDA address emerging safety questions for which claims data may be insufficient3,4.

A common scenario where EHR linkage can be particularly helpful is when certain outcomes of interest may not be captured in administrative claims. For instance, when assessing the suitability of evaluating the potential risk of acute pancreatitis in Sentinel, the FDA considered that claims-based diagnosis codes for acute pancreatis may have poor positive predicted value (PPV)5, which was confirmed in a later validation study to be in the range of 55-66%6. Due to the potential for outcome misclassification that may have led to an underestimation of the effect, outcome ascertainment in linked EHR-claims data was proposed as an alternative. In this report, we describe results from a demonstration project that aimed to assess the applicability of the RWE-DE in a use case of the risk of acute pancreatitis following initiation of SGLT-2 inhibitors compared to dipeptidyl peptidase-4 inhibitors (DPP-4i) in patients with type 2 diabetes mellitus (T2DM). Specifically, we address the limitation of low PPV for pancreatitis in claims-based diagnosis codes by deploying a previously developed computable phenotyping algorithm using elements from EHRs in addition to claims diagnosis codes, which is reported to have a PPV of >90%7. Additionally, we also outline typical workflow of inferential studies utilizing EHR-claims linked data in Sentinel and provide readily usable analytic codes as a turnkey solution for conducting rigorous analyses in a timely way for future needs of the program.

Results

Study cohorts

After applying all eligibility criteria, the final cohorts included 72,429 patients from HealthVerity (30,174 SGLT-2i initiators; 42,155 DPP-4i initiators) and 24,690 patients from TriNetX (11,943 SGLT-2i initiators; 12,747 DPP-4i initiators).

Table 1 summarizes the key patient characteristics of the included participants. For both data sources, the average age was lower in SGLT-2i group than DPP-4i group (57 years versus 60 years in HealthVerity; 55 years versus 56 years in TriNetX). Overall indicators of health including CFI and CCI were comparable between treatment groups in both data sources. Metformin was the most common comedication in both treatment groups across both data sources. The mean count of antidiabetic drug classes was generally comparable (HealthVerity: 1.4 ± 0.8 in the SGLT-2i group and 1.3 ± 0.8 in the DPP-4i; TriNetX: 1.2 ± 0.8 for the SGLT-2i group and 1.1 ± 0.8 for the DPP-4i group). For HealthVerity, the proportion of patients with myocardial infarction (1.7%, 1.1%), stable angina (4.3%, 3.7%), and heart failure (6%, 5.9%) was similarly distributed between the two treatment groups; while for TriNetX, the proportion was generally higher in the SGLT-2i groups for myocardial infarction (3.7%, 1.2%), stable angina (4.8%, 2.3%), and heart failure (13.3%, 5.7%), likely reflecting increasing use of SGLT-2i in more recent time periods after knowledge of their cardiovascular benefits accumulated.

Table 1 Select patient characteristics before propensity score adjustment in the study cohort of patients with Type 2 diabetes mellitus initiating SGLT-2 inhibitors or DPP-4 inhibitors

Structural missing data investigation

Missingness in EHR-based variables was commonly observed as expected (Table 1). In HealthVerity, HbA1c results were available for 29.4% SGLT-2i initiators and 27.3% DPP-4i initiators (mean HbA1c: 8.7 ± 1.9, 8.6 ± 1.9). Serum creatinine and triglyceride levels were recorded for around 20–25% of the patients (mean serum creatinine: 0.9 ± 0.5, 0.9 ± 0.5; mean triglycerides: 176 ± 90.6; 171.4 ± 87.4). BMI was recorded for >60% of the patients in both groups (mean BMI: 32.4 ± 5.5, 31.6 ± 5.7). Blood pressure was recorded for >80% of the patients in both groups (mean systolic: 131.3 ± 16.5, 131.3 ± 16.8; mean diastolic: 79 ± 10.2; 78.3 ± 10.3). Total number of EHR encounters were comparable across both groups (mean EHR encounters: 3.4 ± 2.8, 3.5 ± 2.9). In TriNetX, HbA1c results were available for around 50% of the patients in both groups (mean HbA1c: 8.6 ± 2.0, 8.5 ± 1.9). Serum creatinine and triglyceride levels were recorded for 35–60% of the patients (mean serum creatinine: 0.9 ± 0.3, 0.9 ± 0.4; mean triglycerides: 174.2 ± 93.6; 174.3 ± 92.7). BMI was recorded for approximately half of the patients in both groups (mean BMI: 34.8 ± 8.0, 34.5 ± 7.9). Blood pressure was recorded for >60% of the patients in both groups (mean systolic: 134.6 ± 19.6, 134.2 ± 19.0; mean diastolic: 79.8 ± 12.2; 79.5 ± 11.6).

We observed a monotonic missingness pattern in EHR-based variables for both data sources, as patients with missing data for these key variables are likely to exhibit consistent gaps in other EHR-based measurements (Supplementary Figs. 1 and 2). For the missingness diagnostics (Supplementary Table 3), we observed differences in measured variables between those with and without missing data for EHR-based variables as seen by absolute standardized mean distribution, with medians of around 0.02–0.05. Next, for models predicting missingness, we observed relatively high area under the curve (AUCs) for each of these variables, especially in TriNetX. High AUCs suggest that missing at random (MAR) conditional on measured information may be a likely missingness mechanism. Finally, we evaluated associations between missingness indicator in each of these EHR variables and the outcome (acute pancreatitis outcome of interest). These results indicated that when adjusting for other measured variables, no significant association was observed between missingness indicator and the outcome. This observation provides some reassurance against missing not at random (MNAR) mechanism. Overall, we concluded that MAR may be a reasonable assumption regarding a missingness mechanism for these variables and therefore, multiple imputations are likely to provide best bias-variance trade-off8.

Acute pancreatitis risk

Table 2 shows a comparison of incidence rates (IRs) of acute pancreatitis in new users of SGLT-2i and DPP-4i in HealthVerity and TriNetX. The total event count was 236 and 138 in HealthVerity and was 107 and 41 in TriNetX, for ITT and PP causal contrasts, respectively. Overall, we observed event rates in the range of two to three per 1000 person-years across both data sources in the two follow-up schemes. Figure 1 compares the unadjusted cumulative incidence of acute pancreatitis in new users of SGLT-2i versus DPP-4i with T2DM, including both ITT and PP analyses in both data sources. Overall, the plots suggested that the cumulative incidence of acute pancreatitis was comparable between the two treatment groups for both ITT and PP approach in both data sources.

Fig. 1
figure 1

Cumulative incidence of acute pancreatitis before propensity score adjustment in the study cohort of patients with Type 2 diabetes mellitus initiating SGLT-2 inhibitors or DPP-4 inhibitors.

Table 2 Crude incidence rates of acute pancreatitis among patients with Type 2 diabetes mellitus initiating SGLT-2 inhibitors or DPP-4 inhibitors

Supplementary Figs. 3 and 4 show high covariate balance across all covariates after the PS weighting procedure, reported as mean difference range between two treatment groups across 20 imputations. Figure 2 provides the hazard ratios (HRs) for acute pancreatitis in SGLT-2i initiators compared to DPP-4i initiators with T2DM in HealthVerity and TriNetX. In the claims-only analysis, the pooled adjusted HRs were 0.99 (0.84–1.16) for ITT and 0.94 (0.73–1.20); which notably moved downwards in the claims + EHR augmented analysis [pooled HR 0.85 (95% CI: 0.67–1.07) for ITT and 0.84 (0.58–1.22) for PP].

Fig. 2
figure 2

Relative risk of acute pancreatitis before and after propensity score adjustment in the study cohort of patients with Type 2 diabetes mellitus initiating SGLT-2 inhibitors compared to DPP-4 inhibitors. *Claims only analysis defined acute pancreatitis based on ICD codes alone and adjusted for >130 claims-based variables **Claims + EHR augmented analysis defined acute pancreatitis based on a phenotyping algorithm using EHR data and added 6 additional variables for confounding adjustment with missing data imputed using multiple imputation.

Subgroup analyses and robustness evaluation

For all subgroups considered (age <65 and ≥65, males, females, and history of acute pancreatitis risk factors), we found results consistent with the overall population (Fig. 3). In the two sensitivity analyses where we attempted to reduce missingness proportions for EHR-based covariates by increasing the lookback period and by restricting to those with ≥3 EHR encounters, we noted that the capture increased for all EHR-based covariates (Supplementary Fig. 5) and results did not change meaningfully from the primary analysis. The analysis of the stroke endpoint as a negative control outcome confirmed no difference in risk between SGLT-2i and DPP-4i (pooled HR 0.97, 95% CI 0.82–1.14).

Fig. 3
figure 3

Results from subgroup and sensitivity analysis (propensity score adjusted estimates from claims + EHR augmented analyses). *Acute pancreatitis risk factors included gallstones, tobacco use, or alcohol abuse † EHR loyalty cohort analysis was restricted to subjects with ≥3 EHR encounters in the 6-months before cohort entry.

Discussion

The primary contribution of this study is that it provides an insight into workflows and salient challenges for complex inferential analyses in the Sentinel System moving forward where diverse data types including insurance claims and EHRs from numerous data sources are expected to be routinely used. First and foremost, as soon as a safety concern arises regarding a medical product, it will be crucial to identify data elements that are essential for a given study to ensure fitness-for-purpose of the data sources. For instance, in this study, the need for additional clinical data including amylase and lipase test results to reliably identify acute pancreatitis drove the choice of utilizing the claims-EHR linked RWE-DE, which is relatively smaller with approximately 25 million total covered lives between the two sources used in this study, over the insurance claims-based Sentinel Distributed Database, which is much larger with 128.7 million members currently accruing new data. Next, availability of validated algorithms for complex outcomes that are insufficiently defined by diagnosis codes alone but are possible to identify with computable phenotyping will be critical to deploy in a scalable and timely way. Sentinel has made steady progress in this area with development and validation of numerous computable phenotypes that are likely to be of high interest as safety outcomes in the future including anaphylaxis, suicidal ideation, and sleep-related behaviors9,10. Finally, use of EHR data opens the possibility of more robust confounding adjustment for elements that are traditionally not captured by claims data such as BMI, vital signs, and laboratory test results. However, pervasive missingness in these variables in real-world data sources remain a challenge and appropriate methods to diagnose likely mechanisms and analytically correct missingness will continue to be of vital importance. Sentinel has made substantial advances in this area as well with development of reusable analytic tools and methods designed specifically to handle data missingness in EHR-based variables8,11,12. In this study, we were able to demonstrate use of these tools in routine analytic workflow in an efficient way.

In this use case of a large study involving more than 97,000 patients from the FDA Sentinel RWE-DE commercial network, the incidence of acute pancreatitis was low, and we did not observe evidence for a statistically significant difference in acute pancreatitis risk following initiation of SGLT-2 inhibitors compared to DPP-4 inhibitors in patients with T2DM. We noted that when using claims-only diagnosis codes for acute pancreatitis that are known to have poor PPV (0.55–0.65), the estimates were closer to the null versus when using Sentinel’s phenotyping algorithm for pancreatitis with a PPV of 0.90. These observations are consistent with the general expectations that non-differential misclassification of the outcome, as likely observed here with the claims-based definition, results in a bias towards the null and could result in masking of important differences in the outcome risk between exposures13. Future investigations focused on the outcome of acute pancreatitis should be wary of this potential bias when using claims-based definition. Our findings were consistent across two data sources and across subgroups of age, sex, and acute pancreatitis risk with and without additional adjustment for EHR-derived covariates. This study was conducted in collaboration with the FDA as a methodological demonstration and the findings of this study should be assessed considering the totality of the evidence and not this individual result. Further, direct interpretation of this comparison is inherently challenging as the comparator (DPP4-i) carries a label for acute pancreatitis.

As with any observational investigation, our study has important limitations. First, despite adjusting for many covariates, residual confounding may still be present as treatment decisions are made by treating physician non-randomly and the factors that influence treatment decisions cannot be readily assessed even using RWE-DE. While EHR data allowed us to capture laboratory test results and offered enhanced confounding adjustment, we also observed large missingness in capture of some of these results such as triglycerides which is an important risk factor for acute pancreatitis. As a result, residual confounding by factors that either not recorded or incompletely recorded is possible. Second, while the RWE-DE is one of the largest claims-EHR network constructed to be used for post-marketing safety surveillance purposes in the U.S, richer data from the claims-EHR linkages are obtained at the expense of a substantially smaller total available population size than claims-only networks. Therefore, when investigating rare adverse events such as acute pancreatitis, lack of precision may present challenges in drawing conclusions. As such, it should be noted that the conclusion of no difference in acute pancreatitis risk between SGLT2i and DPP4i in the primary and subgroup analyses in this study may reflect insufficient power for detecting differences small effect size power rather than definitive equivalence. Future work integrating more data sources may be helpful in increasing the population size for the RWE-DE. Finally, validated computable phenotyping algorithms may show performance degradation at different sites. However, it is often infeasible to develop and validate algorithms separately in each data source in near real time when investigating safety of medications in a large scale national post-marketing surveillance program like Sentinel. When resources permit, a smaller scale validation study using manual review of data from newer sites where the algorithm was applied may be considered.

In conclusion, the successful completion of this study in FDA Sentinel’s RWE-DE commercial network serves as a proof-of-concept for future protocol-based assessments in Sentinel requiring EHR data. Analytic pipelines and packages developed by the FDA Sentinel System provide key building blocks to achieve scalable and timely execution of complex analyses using claims-EHR linked data assets.

Methods

Data sources

We used data from the FDA Sentinel RWE-DE commercial network comprising two data partners—HealthVerity and TriNetX. HealthVerity included ambulatory care EHRs from three sources linked to closed medical claims and pharmacy data from 2018 through 2020 while TriNetX included inpatient and ambulatory care EHRs from 20 unique health care organizations (HCOS) linked to closed claims data for the period of 2013–20243.

Specification and emulation of the target trial

We leveraged the “PRocess guide for INferential studies using healthcare data from routine ClinIcal Practice to evaLuate causal Effects of Drugs (PRINCIPLED)” framework14, established specifically to guide conduct of inferential studies in Sentinel, for the proposed study. First, we developed the causal question by specifying a target trial protocol comparing the risk of acute pancreatitis in SGLT-2i initiators to DPP-4i (Table 3). We identified DPP-4i as a comparator group as they are also frequently used as second-line treatment for T2DM and may serve as realistic clinical comparators to SGLT-2i. Of note, the DPP-4i prescribing information includes a Warnings and Precaution for acute pancreatitis based upon post-marketing data and imbalances in clinical trials15. Next, the emulation of each component of the target trial protocol was described using fit-for-purpose linked claims-EHR data with exposure information coming from pharmacy claims and outcome and confounder information from both claims and EHR data as described below. The study protocol was publicly posted on the Sentinel website before the analysis began16. Key components of the target trial protocol are described below.

Table 3 Target trial specification and emulation

Eligibility criteria

Cohort entry was the day of first dispensing of either a SGLT-2i or DPP-4i. The eligibility criteria for the target trial included presence of T2DM diagnosis, no use of study medications, no history of end stage renal disease (ESRD), HIV, or acute pancreatitis, and no use of glucagon-like peptide-1 receptor agonists (GLP-1 RAs) within six months before study medication start. We excluded users of GLP-1 RAs because they share a similar mechanism with DPP-4is and there remains uncertainty regarding the risk of pancreatitis after their use, with some studies suggesting increased risk17,18. Patients with a history of pancreatitis, ESRD, or HIV were excluded as these patients may have elevated risk of future acute pancreatitis events which may not be attributable to the treatment19,20. We further required 6 months of medical and prescription coverage before cohort entry with an allowable enrollment gap of up to 30 days as well as at least one EHR encounter to ensure that patients had observable time in the data. This requirement ensured that patients had contact with the healthcare system to allow for adequate capture of clinical codes to measure eligibility criteria and baseline covariates.

Treatment strategies and follow-up

The two treatment strategies comprised initiation of SGLT-2i (canagliflozin, dapagliflozin, empagliflozin, ertugliflozin, bexagliflozin) or DPP-4i (alogliptin, linagliptin, saxagliptin, sitagliptin) assessed based on pharmacy dispensing data. We estimated observational analogues of the intent-to-treat (ITT or as-started) and per-protocol (PP or on-treatment) effect. Accordingly, follow-up began on the day after exposure initiation and continued until the first occurrence of any of the following: (1) outcome occurrence (acute pancreatitis); (2) health plan disenrollment; (3) recorded death; (4) end of available data; (5) discontinuation/switching from initiated treatment (only for PP analysis).

Outcome

The primary outcome measure was acute pancreatitis assessed using a validated phenotyping algorithm developed for use in Sentinel studies7. Briefly, the outcome was defined probabilistically conditional on information recorded in diagnosis codes and laboratory findings (amylase, lipase, triglycerides). Additionally, for TriNetX, features extracted from clinical notes using natural language processing (NLP), were used in the phenotyping model. In a validation study, it was observed that fixing the PPV at 0.90, the phenotyping model achieves sensitivity of 0.88 with structured features only and 0.92 when adding NLP features. A detailed list of structured diagnosis codes, lab tests, and NLP features used in the phenotyping model can be found in the Supplementary Table 1.

Covariates

Patient characteristics were assessed during 180 days before and including the cohort entry date. These included several claims-based characteristics such as demographics, medications, comorbidities, health service utilization metrics, and indices for general health including a Claims-based Frailty Index (CFI)21 and Combined Comorbidity Index (CCI)22. EHR-based characteristics such as laboratory test results (HbA1c, serum creatinine, triglycerides), vitals and lifestyle factors (body mass index, blood pressure, tobacco use) were also assessed. Supplementary Table 2 contains a detailed description of all covariates.

Statistical analysis

To investigate the added value of EHR data in this study, we first conducted a claims-only analysis that defined acute pancreatitis based on ICD codes alone and adjusted for >130 claims-based variables and then conducted a claims + EHR augmented analysis that defined acute pancreatitis based on the phenotyping algorithm and added 6 additional variables from EHRs for confounding adjustment, both described above.

In all analyses, we used a propensity score (PS) based fine-stratification weighting method with 50 strata for confounding adjustment by measured factors23. We estimated PS as the predicted probability of initiating SGLT-2i compared to DPP-4i given the baseline patient characteristics from fitting a multivariable logistic regression model separately by database. Fifty strata were created based on the distribution of PS in SGLT-2i initiators, and DPP-4i initiators were assigned into these strata based on their PS resulting in 50 unequally sized strata. In the weighting step, DPP-4i initiators in each stratum were weighted proportional to the number of SGLT-2i initiators to account for stratum membership and achieve balance. The PS fine-stratification weighting approach, as implemented in this study, targets the average treatment effect in the treated (ATT), which is considered to be a highly relevant estimand for drug safety investigations24,25. Notably, other PS-based approaches that target different estimands such as average treatment effect in the whole population or average treatment effect in an overlapping population are available and can also be considered depending on the research question of interest.

As diagnostics for PS models, we evaluated distributional overlap, weight distribution, and individual covariate balance using standardized differences post-weighting. In the weighted population, we estimated the hazard ratio for acute pancreatitis among initiators of SGLT-2i versus DPP-4i using Cox proportional hazards model. Cumulative incidence was calculated using cumulative incidence functions and reported stratified by treatment groups26.

Anticipating missing data in EHR-derived variables, we identified and described possible missingness patterns and mechanisms among partially observed covariates based on the observed data using the smdi R package11. After diagnosing the likely missingness mechanisms, we applied the corresponding multiple imputation methods to analytically address missingness in all EHR-based covariates. We created 20 imputed datasets where missing covariates were imputed based on random forest algorithms. In each of the imputed datasets, we fit the PS models and conducted PS fine stratification to calculate adjusted treatment effect estimates using the MatchThem R package27. The results were reported after pooling results using Rubin’s rule to account for variance both within and across the imputed datasets28.

Subgroup analyses and robustness evaluations

All subgroup and robustness evaluations were conducted with claims + EHR augmented analysis. We performed subgroup analyses in the following prespecified strata: age (<65 versus ≥65), sex (male versus female), and history of risk factors for acute pancreatitis (gallstones, tobacco use, alcohol abuse). We conducted two sensitivity analyses aimed at reducing missingness in EHR-based confounders and included: (1) increasing baseline window to 12 months before cohort entry, and (2) restricting the analysis to subjects with ≥3 EHR encounters in the 6 months before cohort entry. Finally, we evaluated ischemic stroke as a negative control outcome to detect net bias in the primary analysis29.