Introduction

Opioids are among the top 10 most-prescribed prescription medications in the U.S., and about 80% of surgical patients are treated with opioids for acute post-surgical pain1,2.

Opioids are also commonly prescribed for patients with moderate or severe chronic pain that is not managed well by non-opioid drugs3. Starting in the early 1990s, opioid prescriptions increased significantly for pain management, leading to surges in overdoses, opioid use disorder (OUD), and the so-called “opioid crisis”4,5. While opioid drugs are very effective for controlling pain, they are highly addictive6.

Side effects of opioid use include respiratory depression and excessive sedation7. Further, patients who take opioids for longer than 90 days have an increased risk of developing OUD8. In the U.S., up to 3 million people have current or past OUD9. It also has been estimated that 80,816 deaths were related to opioid overdose in the United States in 202110. In recent years, opioid prescription rates have dropped precipitously and most deaths are due to illicit fentanyl, but prescription opioids are still associated with about 12,000 overdose deaths in the U.S. each year11. Additionally, opioid-related adverse drug events (ORADEs) can cause harmful patient outcomes, including inpatient costs, readmissions, and mortality12.

Genome-wide association studies (GWAS) have suggested that both OUD and opioid-related patient responses have strong genetic underpinnings13,14,15,16,17,18,19. GWAS have identified significant genomic loci and related genes that can affect efficacy, metabolism, and adverse effects of opioids, which can in turn cause heterogeneous individual responses to drugs, including both pain levels and development of addiction20,21,22. This is particularly relevant to codeine, in which polymorphism alters the function and expression of the CYP2D6 gene responsible for its metabolism and can vary significantly between individuals23.

With promising studies continuously improving our understanding of the genetic architecture of opioid use disorder24, phenotyping opioid-related conditions in large patient populations remains a significant barrier for exploring the genetics. Collecting OUD-related diagnostic information from patients can be time-consuming, complicating the assembly of large sample sizes for GWAS25. One recent genomic study used medication use as a surrogate phenotype to explore disease etiology26. The results suggest that the genetic signature of taking disease-relevant medication could be used to predict future risk of disease. Electronic health record (EHR) datasets contain a large volume of prescription information with high fidelity, which can serve as a useful source for medication use-based phenotypes27,28,29. This phenotyping method could be particularly useful for diseases like OUD30.

In this proof-of-principle study, we utilized matched EHR and genotyping in the Mass General Brigham (MGB) Biobank, a large clinical data depository with patient records from multiple hospitals, to develop opioid prescription-based phenotypes. We selected codeine, one of the most commonly prescribed opioids worldwide31,32, as a test case with number of prescriptions, an easily generalized trait, for phenotype development. We constructed multiple prescription count-dependent pattern measures for genetic analysis. We then used both GWAS and polygenic risk score methods to investigate the genetic basis of these prescription patterns.

Methods

Data source

The clinical and genetic data in this study were obtained from the MGB Biobank. The MGB Biobank is a large integrated database, including high-quality clinical data from multiple Harvard-affiliated hospitals33. For our genome-phenome association study, we extracted matched genetic and clinical phenotype information from 36,239 European ancestry subjects based on patient self-reported records. The present analysis includes only individuals with European ancestry to minimize the risk of confounding due to ancestry differences. The study’s protocol was reviewed and approved by the Mass General Brigham Human Research Committee (study design summarized in Fig. 1).

Fig. 1
figure 1

Summary of study design.

Codeine count granularity measurement phenotypes

We screened ~ 16 million medication records from 2010 to 2020 in the study population and identified codeine prescription records by using keyword search. Three categories of codeine prescription count measures were used to develop 8 phenotypes to reflect different levels of information granularity:

  1. (1)

    Three low-count prescription phenotypes: patients with 1, 2 or 3 codeine prescriptions.

  2. (2)

    Four high-count prescription phenotypes: patients with 4 or more, 5 or more, 6 or more, or 7 or more codeine prescriptions. For both low- and high-count prescription groups, the control group was defined as patients with no opioid prescriptions.

  3. (3)

    All-count prescription phenotype: codeine prescription count was coded as integers and winsorized at 8 prescriptions to reduce the influence of outliers.

Genotyping data and quality control

Genotyping was performed by the MGB Biobank team. Prior to imputation, standard GWAS quality control procedures were carried out. These included: (1) sample-level QC. samples with discrepant reported and predicted sex or high missing rates were excluded; (2) Variant-level QC. variants with invalid alleles, allele mismatch with the reference panel, SNPs not found within the reference panel and duplicated, monomorphic variants, indels (insertion and deletions), and variants with low call rate (less than 90%) were excluded. Imputation was performed using the Michigan Imputation Server with 1000 Genomes panel and haplotype phasing was performed using SHAPEIT34,35,36.

Post-imputation quality control was conducted to select high-quality SNPs and control for population stratification and family structure. The relatedness of the cohort was detected by pairwise IBD estimation filtered by pi-hat (1 for 100% identical by descent [IBD], 0.5 for 50%, 0.25 for 25%) using PLINK to estimate the probability of sharing 0, 1, or 2 alleles IBD for any two individuals from the study population. Only autosomal biallelic SNPs with minor allele frequencies (MAF) of at least 1%, an info score above 0.8 and call rates above 98% were retained, which led to ~ 5 million SNPs. A principal components analysis was applied in a linkage-disequilibrium-pruned set of genotyped SNPs to characterize population structure within samples from included individuals.

Genome-wide association and gene-level analysis

We used PLINK 2.0 to conduct the genome-wide association analysis for each codeine prescription phenotype, using linear regression for continuous phenotypes and logistic regression for binary phenotypes37. All association analyses were adjusted for age, sex and the top 5 principal components. We used functional mapping and annotation (FUMA) and multi-marker analysis of genomic annotation (MAGMA) to conduct gene-based tests and pathway analysis38,39. A standard genome-wide significance threshold of p < 5 × 10− 8 was chosen for SNP identification and r2 = 0.6 was set as the cutoff for independent significant SNPs. The maximum distance of linkage disequilibrium (LD) blocks to merge was 250 kb. All Manhattan plots were generated by FUMA.

Disease polygenic risk score and correlation analysis

Summary statistics for multiple disease traits were obtained from two external data resources: (1) the Psychiatric Genomics Consortium (PGC);40 (2) the United Kingdom BioBank using the Pan-UK Biobank developed by team from the Analytical and Translational Genetic Unit (ATGU) of Massachusetts General Hospital and the Broad Institute of the Massachusetts Institute of Technology and Harvard41,42. We selected three categories of phenotypes for PRS development based on clinicians’ suggestions, including: (1) opioid use disorder and alcohol dependence; (2) brain and mental health phenotypes (Alzheimer’s dementia and Attention deficit hyperactivity disorder); and (3) other phenotypes (Hyperhidrosis, Standing height, ECG heart rate, Glaucoma and Diabetic hypoglycemia). Other phenotypes serve as negative controls for PRS. With these external summary statistic datasets, we used PRC-CS43, a python tool that utilizes a Bayesian regression framework to output optimized SNP effect sizes representing these diseases. We then developed patient-level polygenic risk scores among MGB patients for nine conditions, including positive (i.e., OUD) and negative (e.g., hyperhidrosis) controls. The default parameters of PRC-CS were used for the analysis. We used 830,461 SNPs from the 1000 Genomes reference panel for PRS construction. We then calculated Kendall correlations between disease polygenic risk scores and codeine prescription count in MGB patient population.

Results

EHR-derived codeine prescription count phenotypes

Using ~ 16 million medication records in MGB clinical database, we identified 8,639 patients with codeine prescriptions during 2010 to 2020, with approximately 700 to 1500 patients per year (Supplementary Fig. 1). We developed multiple codeine count measures based on the number of separate codeine prescriptions per patient (summarized in Table 1). We then used these measurements to develop 8 phenotypes for genome-wide association analyses. We used linear regression to capture all count distribution patterns with a continuous measure, while logistic regression was applied for either low count or high count patterns. We observed relatively older mean age in high count (patients with four or more codeine prescriptions) group compared with low count group. High count patients also have more incidence of diagnoses and clinical encounters in their EHR records (Supplementary Table 1).

Table 1 Summary of codeine prescription count phenotypes.

Genome-wide association analysis

Setting the p-value threshold at 5 × 10− 8, 9 significant genomic risk loci were identified from the all-count phenotype (Fig. 2; Table 2 and Supplementary Fig. 2). The most significant lead SNP was rs2902921 (p = 6.44 × 10− 19), an intergenic SNP on chromosome 4. In addition, two loci (rs709286 and rs11164801-THRAP3, SH3D21, EVA1B, RP11-268J15.5, STK40, LSM10,EVI5, RPL5 and FAM69A) on chromosome 1, one locus on chromosome 2 (rs11680325 – CYP1B1), one locus on chromosome 3 (rs375170584), one locus on chromosome 4 (rs2902921), one locus on chromosome 5 (rs55905691-TSLP, WDR36), one locus on chromosome 11 (rs364139), one locus on chromosome 14 (rs2093210-C14orf39) and one locus on chromosome 17 (rs12453884-TAOK1, ABHD15 and TP53I13) were also identified.

Fig. 2
figure 2

Manhattan plots for GWAS of all prescription count phenotypes.

Table 2 Summary of identified significant SNPs.

Various numbers of significant genomic loci were identified from low- and high-count phenotypes (Fig. 2; Table 2, and Supplementary Fig. 3). Two lead significant SNPs identified by three low-count measures (rs13103207 and rs78121242) were in the same LD region with rs2902921 (both R2 > 0.1) identified from the all-count phenotype. High-count phenotypes generally showed more similar genetic associations with the all-count phenotype. Thresholds of 4 or more and 5 or more prescriptions identified 7 and 8 significant genomic loci, respectively. All these loci were shared with the all-count phenotype. Fewer loci were identified with 6 or more and 7 or more prescriptions (5 loci and 3 loci, respectively), although they still overlapped with loci from the all-count phenotype.

Mapped genes and related functions

Using a two-sided distance of +/-10 kb region in proximity to identified genomic loci, we identified genes that could be related with regulatory functions of these variants (summarized in Table 3). Sixteen related genes were found from the all-count phenotype, while 2 to 14 genes were from high-count prescription phenotypes. As more extreme prescription ranges were applied, fewer mapped genes were found, corresponding to fewer significant loci from GWAS. Two genes were remained across all phenotypes: CYP1B1 and C14orf39. CYP1B1 is a member of the cytochrome P450 superfamily of enzymes, one of major enzyme families for drug metabolism44. C14orf39, also known as Six6os1, has been related to primary ovarian insufficiency45.

Table 3 Summary of mapped genes.

Comparison between our study and previous opioid genetic studies

We compared our results with previously published opioid-related GWAS (Supplementary Table 2)14,15,16. Of SNPs previously reported, rs9291211 was associated with opioid use in patients of European-ancestry. In our sample, rs9291211 showed various levels of weak associations with different codeine prescription phenotypes, with the relatively stronger signal in 6 or more prescriptions (p = 0.00017677), followed by 7 or more prescriptions (p = 0.00035105). Another three reported SNPs, rs1989903 (opioid use disorder) and rs12130499 (opioid dependence), and rs7188250 (opioid use disorder) also showed weak association p-values in our samples.

Polygenic risk score correlation analysis

We downloaded summary statistics for nine separate conditions from the Psychiatric Genomics Consortium (PGC) and the Pan-UK Biobank and developed patient-level disease polygenic risk scores for these conditions in the MGB study cohort15,46,47,48,49. Among them, codeine prescription count was significantly correlated (Tau = 0.67, p = 0.0127) only with the polygenic risk score for OUD (Table 4).

Table 4 Correlations of disease polygenic risk score and codeine prescription count.

Discussion

The availability of genomic and clinical data in large data repositories, including Electronic Medical Records and Genomics (eMERGE) and UK Biobank42,50, has enabled researchers to perform more powerful genome-phenome association studies. The All of Us Research Program initiated by the NIH51, with clinical and genomic data expected from 1 million individuals, represents a new era of integrated big data consortia that has the potential to advance precision medicine research to a higher level. Through these studies, information from both the genomic and clinical perspectives can be fully integrated into association models to generate more comprehensive descriptions of disease status. We applied these methods to a large clinical biobank to assess relationships with codeine prescription number, as a test case for opioids, and found 9 loci with strong associations with a high count of codeine prescriptions.

Clinically meaningful phenotypes are critical for disease-oriented genetic research, especially for complex clinical conditions, such chronic diseases or diseases with complicated prescriptions52,53. An accurate and generalizable phenotyping approach could enable a better chance to identify related genetic markers54,55. This is particularly true for disease phenotypes which are challenging to develop, such as phenotypes related to diagnoses of substance dependence and substance use disorder. Due to their sensitive and complex nature, with no simple diagnostic test, this type of diagnostic information will generally be difficult to obtain and hence missing in large numbers of subjects56. Reliance on administrative codes is also problematic; early cases will tend to be missed and diagnoses may be biased by physician factors. Lack of documentation of substance use disorder in patient records also creates a significant limitation for conducting large-scale genetic studies and replications.

Recent studies have suggested that medication use can serve as a useful phenotype method for exploring the genetic basis of medication-related diseases and conditions. Genetic susceptibility of common diseases can be associated with traits of taking relevant medications26. This reverse causality approach provides a useful way to examine disease etiologies by investigating the genetic basis of patients who receive certain medications.

Prescription records can be easily retrieved from EHR databases in large patient populations with high fidelity because prescribing is invariably a core function of EHRs. Using prescription data, related disease traits can be developed. This approach provides phenotypes that can supplement diagnosis-based phenotypes with several unique advantages: (1) for diseases with more difficult (e.g., time consuming or hard to obtain) diagnostic records, relevant medication use can serve as a much easier indirect phenotyping method; (2) for chronic diseases with multiple progression stages, diagnosis-based traits might miss patients with early or subclinical conditions, while medication-based traits could capture a broad range of patients at early stages with less-extreme conditions; (3) medication-based traits can be developed in both a continuous or categorical manner. For example, prescription numbers can serve as a numerical-based measurement, which could potentially provide possibilities to reflect different risk levels in patient populations; (4) large phenotype groups can be created based on prescription records to gain more power for genome-wide association analyses. During phenotyping process, multiple prescription-related variables (medication type, count, dosage, duration etc.) can be used to assemble phenotypes with different levels of granularity.

In this study, we explored the feasibility of conducting opioid-related genetic research using patients’ prescription records. We selected codeine, an opiate with known heterogeneous metabolization between individuals, to capture a patient population with different levels of risk of adverse opioid-related outcomes. We also utilized prescription count to develop targeting phenotypes, requiring no granular prescription information, such as dosage. With this design, we are aiming to test our phenotyping pipeline in a baseline setting with a high generalizability.

Multiple prescription count were used to capture different patterns of codeine exposure. Previous studies have demonstrated that patients with a high count of opioid prescriptions tend to have long-term use and addiction57, suggesting an association between opioid prescription pattern/intensity and levels of future opioid use disorder risk. Considering this finding, we aimed to explore the potential genetic components of this association. Since this association might not be linear, we developed various prescription pattern measures to guide the genetic analysis.

Based on these prescription measures, we observed a count-dependent genotyping-phenotyping pattern, with higher prescription number phenotypes associated with stronger genetic signals. Substantial overlap was also identified across all phenotypes, suggesting a common genetic component among all prescribing counts. In our finding, lower numbers of prescriptions (1, 2, and 3) showed much weaker signals than higher numbers. When patient populations with a greater number of prescriptions (> 6) were selected, we observed a potentially greater specific genetic association relationship with a smaller number of significant SNPs. This pattern is consistent with gene-level analysis with only two genes remaining in the 7 or more phenotype. Both genes were concordant with disease mechanisms from previous studies. CYP1B1, a gene coding for a major enzyme of drug metabolisms, could be particularly relevant to opioid drug responses. Consistent with our finding, a recent study showed the association between CYP1B1 and EHR-derived opioid response58.

We used two approaches to validate our findings. First, we checked previously reported opioid use associated SNPs in our results and identified weak association p-values for several SNPs. Second, we examined the correlations between codeine prescription number with multiple clinical diseases/conditions using polygenic risk scores derived from independent summary statistics. The polygenic risk of opioid use disorder was significantly correlated with the observed number of codeine prescriptions, validating that this risk score is specifically associated with an expected phenotype. Accordingly, a higher mean PRS was observed in high count population (four or more codeine prescriptions) compared with low count patient population (three or less codeine prescriptions). Based on previous literatures, mental disorders are common among patients with opioid use disorder59. Furthermore, there is a positive association between mental disorder and opioid prescriptions60. Opioid use can also be related with other substance use disorders61, suggesting a broad scope of addiction and psychiatric conditions could be also associated opioid prescription with PRS methods in clinical practices.

Limitations

This study has several limitations. First, we only selected one type of common opioid drug, codeine, for phenotype development. Other major opioids were not included in the current study, which limits patient population we investigated. Compared with more potent opioids (e.g. hydrocodone and oxycodone), codeine is considered as a relatively weak opioid drug62, with a morphine milligram equivalent 10% of oxycodone. But the population we captured might better reflect early-stage risky population. Another reason to choose codeine is its metabolism. Codeine is one of opioids with clinically actionable gene variants supported by international guideline of drug dosing alterations, making it an interesting research target63. As a proof-of-concept study, we did show the feasibility of our phenotyping pipeline in this population for opioid-related genetic study and validated our finding. Second, the medication use phenotype, compared with diagnosis-based traits, is an indirect approach to reflect the at-risk population. The patient population we captured using this approach could be more heterogeneous with a broader spectrum of disease progression status, which can create heterogeneity in associated genetic signals. In the meantime, the specificity and sensitivity of this phenotyping system can be adjusted by using different cut-off thresholds. In our study, by testing different stringent phenotyping criteria, i.e., the number of prescriptions, we did observe a codeine count-dependent pattern for genetic hits. This provides the potential to calibrate optimal phenotyping thresholds to serve genetic studies with different purposes. For example, researchers can use this method to investigate phenotypes with different sensitivities or specificities for targeted diseases or conditions. Third, the phenotypes we created in the current study only focused on prescription count (numbers of records in EHR database). We did not include dosage or quantity information, which is another important component of prescription decision-making, and we were unable to incorporate prescriptions received outside of the MGB hospital system or prescriptions that were written but not filled. Further, the length of time for codeine prescriptions was not incorporated in current phenotyping pipeline due to lack of high-granular prescription time/duration information and more complete medical history records. Considering these prescription variables require more complete EHR dataset, which could be lacking in many current biobank data depositories, prescription count may be a more generalizable phenotyping method across databases. As a next step, we will incorporate other opioids and standardized opioid dosage and prescription duration information in future studies for a more advanced phenotyping pipeline. We will further incorporate other medical records, including prescription records for outpatient setting (drug monitoring program), patient medical history (e.g. psychiatric comorbidities) and co-prescription records (e.g. stimulant prescriptions). We will explore and identify optimal risk threshold, uncertainties or confidence intervals of PRS. With that, we will develop PRS tool to predict patient-level or population-level risk of opioid use disorder.

Conclusion

We utilized patient-level medication data from a large clinical biobank to develop codeine prescription number phenotypes for genetic research. We observed an interesting pattern of prescription-count dependent genomic signals, suggesting that medication prescription-based phenotypes could be used to capture various levels of opioid-related risk populations in genetic study. Our results provided a novel and generalizable phenotyping framework for opioid-related genetic research.