Introduction

Sickle cell disease (SCD) is a significant public health problem in the United States as well as globally. An estimated 100,000 patients with sickle cell disease live in the United States. These patients face reduced life expectancy by about 30 years and decreased quality of life as a result of the multisystem and long-term effects of this devastating disease1. Sickle cell anemia (SCA) is caused by homozygosity of hemoglobin S, a result of a missense mutation in the β-globin gene that substitutes valine for glutamine at the sixth amino acid in the β-globin hemoglobin chain1. Among children with same disease-causing mutation, clinical phenotypes vary1. Common presentations include acute pain events such as vaso-occlusive episodes (VOE) and dactylitis, which confer significant morbidity and are associated with lower quality of life across several domains2. Another acute presentation of SCD is acute chest syndrome (ACS), defined as fever and respiratory symptoms with new infiltrate on chest radiograph, which is associated with high risk of morbidity and mortality3. Among patients, some will never have an episode of ACS, while others will have frequent episodes4. This variability is likely due to genetic and non-genetic factors, of which current knowledge is limited.

A few protective factors have been identified, including certain subtypes of SCD and higher levels of fetal hemoglobin (HbF)5. Higher HbF levels in erythrocytes decrease the hemoglobin polymerization that leads to erythrocyte “sickling” and complications of SCD6. Higher HbF levels have been associated with decreased rates of ACS and VOE, although rates of priapism, stroke and other complications are minimally affected7. To investigate the genetic mechanism of HbF levels, multiple genome-wide association studies (GWAS) have been conducted, which are instrumental in discovering genetic associations with a wide variety of clinical conditions, including metabolic, autoimmune, and psychiatric diseases8. Uda et al. discovered that BCL11A is associated with persistent HbF levels and amelioration of the phenotype of β-thalassemia9. Galarneau et al. further reported that polymorphisms at three loci, BCL11A, HBS1L-MYB, and β-globin, account for approximately 50% of the variation in HbF levels10. However, studies to identify underlying genetic drivers of other clinical features such as ACS and vaso-occlusive crises have had limited results11,12.

In this study, GWAS was performed on a cohort of patients with hemoglobin SS (HbSS) followed by the Children’s Hospital of Philadelphia Comprehensive Sickle Cell Center. To better understand the genetic basis of this disease, we planned to study the entire genome in search for new candidate genes conferring increased risk of clinical features of SCA. Our aims were to 1) determine whether variations in the human genome associate with clinical features including ACS, VOE and HbF in children with HbSS, and 2) explore the potential biological mechanisms underlying ACS by identification of genomic variants.

Methods

Study participants

Our study population consists exclusively of patients with HbSS. We conducted a genome-wide association study (GWAS) to investigate the associations between genotype and various clinical phenotypes, including acute chest syndrome (ACS), vaso-occlusive episodes (VOE), hemoglobin levels, pain, and hospital admissions.

Subjects with HbSS were identified from the biorepository at the Center for Applied Genomics (CAG) at the Children’s Hospital of Philadelphia. DNA samples of participants were collected at recruitment. The research was performed in accordance with the Declaration of Helsinki. The Children’s Hospital of Philadelphia Institutional Review Board approved the study (IRB 16-013,278). Written consent was obtained from the legal guardian(s) of each participant and assent from any child 7 years and older.

Inclusion criteria were patients with at least one encounter at the Children’s Hospital of Philadelphia, with documentation of HbSS. Phenotyping was performed by manual chart review and electronic health record (EHR) data analysis for EHR data between January 2007 and December 2020, since the current EHR system was not routinely used prior to 2007. Analysis of clinical data was performed using R. Clinical factors are listed in Table 1. ACS was defined as fever and/or respiratory symptoms with a new infiltrate on chest radiograph. Vaso-occlusive episodes were defined as acute pain without alternative identified cause, e.g., trauma or constipation. Manual chart review identified episodes of ACS, VOE, and dactylitis requiring an emergency room (ER) visit or hospital admission. The number of ACS episodes per patient were compared to EHR codes, and outliers were manually reviewed. International Classification of Disease 9th (ICD9) and 10th (ICD10) codes for ACS and pneumonia. Hemoglobin F (HbF) levels were excluded within 120 days following blood transfusion or following the initiation of hydroxyurea, which is known to increase HbF expression5. HbF levels stabilize between the ages 3–5 years13, thus the most recent HbF level was used for analysis to best represent baseline HbF.

Table 1 Demographic and clinical characteristics of the study cohort.

Genotyping, imputation, and association analysis

The discovery and replication cohorts were genotyped using the Illumina Infinium Global Screening (GSA) and HumanHap550/610 single nucleotide polymorphism (SNP) arrays respectively. EIGENSTRAT was used to detect potential substructures and outliers14. Participants with self-reported African ancestry were further confirmed by comparing principal component analysis results of participants and reference populations from Hapmap3 (Figure S1). Samples with chip-wide genotyping failure rate greater than 5% were excluded. SNP markers with minor allele frequencies less than 1%, genotyping failure rates greater that 2%, and Hardy–Weinberg P values less than 1 × 10−6 were removed before genotype imputation. Pairwise identity-by-descent values were calculated by PLINK to remove cryptic relatedness and duplicated samples15. Genotype imputation was performed with the TOPMed Imputation Server using minimac4 imputation algorithm16. The whole genome sequencing data from the Trans-Omics for Precision Medicine (TOPMed) program were used as the imputation reference panel, which achieved a significant improvement in imputation qualities and accuracies of variants from African populations17. Common variants (minor allele frequencies > 1%) with high imputation confidence (Rsq > 0.3) were retained for association analysis. Association analyses were performed using logistic regression with an additive model on the imputed dosage of the effect allele while adjusting for sex and the first five PCs. Meta-analysis was performed by PLINK. Fixed-effects P values were reported. Although no significant genomic inflation was detected (Figure S2), we calculated the genomic inflation factor (λ) to ensure the accuracy of our findings. Genomic control correction was applied by dividing each observed chi-squared value by λ, and recalculating the P-values from these corrected chi-squared values. This adjustment helps to reduce false-positive associations by correcting for any systematic bias in the data. The corrected P-values are included in Table 1 and were used in all subsequent analyses. To enhance the reproducibility and transparency of our research, we have deposited the summary statistics of our results in a public database. The data can be accessed at the following link: https://zenodo.org/records/11948788.

Gene-based association and enrichment analysis

We performed gene-based association analysis using MAGMA (Multi-marker Analysis of GenoMic Annotation), a tool designed for integrating GWAS data to identify genes associated with disease traits18. This approach accounts for gene size, linkage disequilibrium patterns, and SNP localization within genes, providing a comprehensive assessment of gene-level associations from our GWAS summary statistics. Subsequently, genes with significant associations (P < 0.01) were selected for further analysis. To elucidate the biological implications of these significant genes, we conducted enrichment analysis using DAVID (Database for Annotation, Visualization and Integrated Discovery)19. This step allowed us to identify enriched biological processes, pathways, and functional categories, offering deeper insights into the biological mechanisms potentially driving the disease pathogenesis and highlighting possible therapeutic targets. This combined analysis ensures a robust exploration of genetic findings and their functional relevance to the disease.

Results

Cohort characteristics are shown in Table 1. Complete data are available for 520 subjects. The median age of first EHR encounter is 1 year. EHR data are available for each patient over a median period of 11 years (min 1 year, max 14 years). Among all subjects, 38.6% have a diagnosis of asthma at any point, and 26.5% have been prescribed an inhaled corticosteroid. Hydroxyurea was prescribed at least once for 80.4% of the cohort. 91.5% have received a blood transfusion. Six patients have undergone stem cell transplant, of which one underwent in utero haploidentical stem cell transplantation with no evidence of engraftment.

Subjects have a mean of 0.16 episodes of ACS per year. In this regard, 306 patients have at least 1 episode of ACS. For pain episodes, subjects have a mean of 0.98 episodes per year of VOE requiring admission or ER visit. When including dactylitis with pain episodes, subjects have a mean of 1.01 episodes per year. Overall, subjects have a mean of 1.58 total admissions per year for any cause. HbF levels meeting the above criteria are available for 467 subjects. The mean HbF level is 15.25%. As previously described, there is a downtrend in HbF levels over the first 2 years of life, followed by stability.

The discovery cohort and the replication cohort included 391 and 129 individuals, respectively. GWAS analysis was performed for ACS, VOE, and pain (defined as VOE or dactylitis) episodes per year, total number of admissions per year, and HbF levels. Since no genome-wide significant signal was detected in either cohort alone, we then performed a meta-analysis combining the two studies. Table 2 lists the clinical outcomes of interest reporting statistically significant loci. Corresponding regional plots and Manhattan plots are shown in Fig. 1 and S3 and Figure S4.

Table 2 Association results of detected risk variants.
Fig. 1
figure 1

Regional plots of the four genome-wide significant loci associated with clinical phenotypes of SCD using LocusZoom42 listed in the GTEx portal. Purple diamond indicates the most significantly associated SNP, and circles represent the other SNPs in the region, with coloring from blue to red corresponding to r2 values from 0 to 1 with the index SNP. (A) Hemoglobin F, 15q14; (B) acute chest syndrome: 15q26.1; (C) pain, 2p25.1; (D) pain, 15q26.3. All images in the figures were generated using a combination of R and PowerPoint. The data visualizations and analyses were conducted using R (version 4.3.1), which can be accessed at https://www.r-project.org/. The final assembly and labeling of the figures were completed using Microsoft PowerPoint (Microsoft 365 version).

To understand the possible functional role of significant SNPs, we queried the Broad Institute’s HaploReg v4.1 database (https://pubs.broadinstitute.org/mammals/haploreg/haploreg.php)20. HaploReg annotations indicated the tested SNPs were highly associated in regulatory regions of the genome including enhancer activities, promoter histone markers, DNAse I hypersensitive regions and regulatory motifs (Table S1).

HbF levels showed a significant association with SNPs centered at two loci. As expected, the previously reported 2p16.1 locus21,22,23 demonstrated the highest significance. There are two SNPs reaching genome-wide significance at 2p16.1 (rs1427407, p = 8.58 × 10–10, and rs766432, p = 6.72 × 10–9, Table 2). Both are located within the BCL11A gene, and the linkage disequilibrium LD between these two SNPs is R2 = 0.79. This strong LD suggests that these SNPs are not independent and likely represent a single signal. In addition, we found a new genome-wide significance locus at 15q14 (rs8182015, p = 2.07 × 10–8, Table 2), which is close to gene, EMC7. Interestingly, the GTEx Portal indicated that rs8182015 is an expression quantitative trait locus (eQTL) of other nearby genes including CHRM5 and GOLGA8A (Table 3).

Table 3 Variants associated with gene expression levels according to GTEx portal ***.

For ACS, one significant locus was identified at 15q26.1. The lead SNP, rs79915189 (p = 3.70 × 10–8, Table 2), is located 3649 bp upstream of the IDH2 gene, which encodes an isocitrate dehydrogenase expressed in mitochondria24. According to GTEx Portal, rs79915189, is an eQTL of nearby genes including IDH2, CIB1, ZNF774 (Table 3).

Episodes of VOE were most strongly associated with the 15q26.1 locus (rs62020555, p = 2.04 × 10–9, Table 2). The lead SNP is located approximately 3 kb upstream of ZNF710, a zinc finger protein. Other significant loci associated with VOE rate were rs117797325 (p = 4.63 × 10–8, Table 2), approximately 3 kb downstream of LRRK1, and rs62118798 (p = 4.27 × 10–8, Table 2), within the ASAP2 gene. The inclusion of dactylitis episodes, a pain presentation more common in younger children, did not significantly change results. When including dactylitis in addition to VOE as a pain outcome, we found one additional locus met the threshold for significance (rs71605708, p = 4.28 × 10–8, Table 2). The rate of hospital admissions, a strong surrogate marker for overall morbidity, was associated only with locus17q21.31. The lead SNP is 23 kb upstream of a transcription factor coded gene ETV4.

Our gene-based association and enrichment analysis further illustrates a significant role for immune response-related pathways, particularly those involving MHC class II protein complexes, across different clinical traits in HbSS patients (Table S2). We found marked enrichment for processes associated with antigen processing and presentation, immune system regulation, and cellular transport mechanisms in traits such as HbF, ACS, pain, and VOE. For instance, traits like HbF and ACS showed a pronounced enrichment in MHC class II-related activities, which are critical for immune function. Additionally, pathways involved in the regulation of T cell activation and immune response were notably enriched, highlighting their potential contribution to the pathophysiology of these conditions.

Discussion

There is significant variation in the clinical presentation among patients with sickle cell disease that remains to be elucidated. Here we present a GWAS of a cohort of 520 pediatric patients with HbSS. Significant loci were identified for all outcomes of interest. HbF findings are consistent with prior studies demonstrating BCL11A as a driver of HbF levels21,22,23. BCL11A is known to decrease HbF production through direct repression of the γ-globin genes and subsequently has a role in the transition from fetal to adult hemoglobin production25. More recently, Esrick et al. reported on six patients treated with short hairpin RNA (shRNA) knockdown of BCL11A, of which none had an episode of vaso-occlusive pain or ACS following treatment26. In addition, Frangoul, et al. reported on CRISPR-Cas9 editing of the BCL11A enhancer27. This highlights the importance of identifying genetic drivers of severe phenotypes, which may have potential as therapeutic targets. In addition, we detected a new significant locus (15q14) associated with HbF levels. While it is not clear how this locus may influence HbF production, Gtex data indicate that genotypes of the lead SNP rs8182015 are correlated with the expression level of nearby genes suggesting a regulatory role of this SNP. Consistent with this result, HaploReg show that rs8182015 is located in the region of regulatory motifs suggesting that this locus may warrant further investigation.

This study also identified a locus (15q26.1) of interest in ACS. The lead SNP rs79915189 is close to gene IDH2. It is also an eQTL of IDH2 in epithelial tissues (Table 3). Moreover, multiple lines of evidence suggest a potential role of IDH2 in respiratory function and lung injury28,29,30,31.

Pain episodes were most strongly correlated with a SNP on chromosome 15, located upstream of ZNF710. ZNF710 codes for a zinc finger protein that has a role in transcription regulation and has been implicated in clear cell renal cell carcinoma32, but the functional role of ZNF710 in pain episodes is not clear.

LRRK1 on chromosome 15 codes for leucine rich repeat kinase 1. Several studies have reported an association with bone disease. Homozygous mutations in the LRRK1 gene can lead to osteosclerotic metaphyseal dysplasia, a disease characterized by skeletal dysplasia and abnormal bone density distribution in the long bones, ribs, vertebrae and iliac crests33,34,35. Interestingly, vaso-occlusive pain most commonly localizes to the back, legs, or hips36,37,38. With the inclusion of dactylitis as a pain outcome, a locus associated with SLIT3 met the significance threshold. SLIT3 may also play a role in skeletal homeostasis. Mouse studies have found SLIT3 knockout mice had decreased long bone ossification and length39 and decreased fracture healing due to reduced skeletal angiogenesis40.

Complications and complication rates vary significantly among patients with SCD, even among those with the same hemoglobin genotype. The identified genetic variants have significant potential to impact clinical management, risk stratification, and personalized treatment for SCD. As demonstrated in prior BCL11A studies, understanding the genetic regulation of HbF levels by genes such as BCL11A can inform targeted therapies to increase HbF production, thereby reducing the severity of SCD symptoms. Therapies that boost HbF levels could be particularly beneficial for patients with genetic variants associated with lower HbF production, offering a tailored approach to reducing disease complications. The association of ACS with loci near IDH2 suggests potential pathways for therapeutic intervention in respiratory complications of SCD. Targeting these pathways could mitigate the frequency and severity of ACS episodes, providing a focused strategy for managing this life-threatening complication. Furthermore, identifying loci linked to pain episodes offers valuable insights into the underlying mechanisms of pain in SCD, potentially leading to the development of new pain management and prevention strategies tailored to the genetic profiles of individual patients, enhancing their quality of life. Incorporating these genetic findings into clinical practice could significantly enhance risk assessment and prognosis for patients with SCD. Patients with genetic variants associated with higher HbF levels might have a more favorable prognosis. Conversely, those with variants linked to severe ACS or frequent pain episodes could be identified as high-risk and receive closer monitoring and more aggressive interventions. This personalized approach to treatment based on genetic profiling represents a significant advancement in the management of SCD, potentially improving outcomes and reducing morbidity.

This study has limitations. Being a single-center study conducted at an urban academic center, the generalizability of our findings is limited. To enhance generalizability and identify additional clinically relevant loci, future research should involve multicenter studies with larger and more diverse populations. Additionally, we lacked data on environmental factors, which may contribute to the risk of acute complications in sickle cell disease. Including such data in future studies will help provide a more comprehensive understanding of disease variability. Acute chest syndrome (ACS) and vaso-occlusive episodes (VOE) present further challenges due to the lack of objective diagnostic criteria and the involvement of multiple pathophysiologic processes, which may limit the ability to detect significant loci. Developing and validating more objective diagnostic criteria for ACS and VOE will standardize outcomes and improve the accuracy of future studies. In addition, approximately 9 months of the study period overlapped with the COVID-19 pandemic, during which respiratory viruses, a known trigger of ACS3, dropped precipitously among children41. However, this represented a minority of the study period, thus SCD complication rates were unlikely to be significantly affected. Another limitation of our study is the absence of colocalisation analysis between GWAS and eQTL data. This decision was due to the population differences: our GWAS data is from African populations, while the GTEx eQTL data is primarily European. Conducting colocalisation under these conditions could introduce biases and affect result validity. Future research should aim to use population-matched eQTL datasets to enable accurate colocalisation analyses.

In conclusion, our GWAS on over 500 pediatric patients with HbSS has provided valuable insights into HbF levels and identified several loci associated with various clinical phenotypes in SCD. These genetic variants have important clinical implications, offering opportunities to enhance clinical management, risk assessment, and therapeutic interventions. Understanding the genetic regulation of HbF levels can guide therapies to increase HbF production, reducing SCD severity. Loci associated with acute chest syndrome and pain episodes offer pathways for therapeutic interventions and personalized pain management strategies. Incorporating these findings into clinical practice can improve risk stratification and prognosis for SCD patients, advancing personalized medicine and enhancing patient outcomes. Our study also suggests that future GWAS with larger sample sizes will likely uncover more clinically relevant loci.