Abstract
Infections can lead to persistent symptoms and diseases such as shingles after varicella zoster or rheumatic fever after streptococcal infections. Similarly, severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) infection can result in long coronavirus disease (COVID), typically manifesting as fatigue, pulmonary symptoms and cognitive dysfunction. The biological mechanisms behind long COVID remain unclear. We performed a genome-wide association study for long COVID including up to 6,450 long COVID cases and 1,093,995 population controls from 24 studies across 16 countries. We discovered an association of FOXP4 with long COVID, independent of its previously identified association with severe COVID-19. The signal was replicated in 9,500 long COVID cases and 798,835 population controls. Given the transcription factor FOXP4’s role in lung physiology and pathology, our findings highlight the importance of lung function in the pathophysiology of long COVID.
Similar content being viewed by others
Main
The coronavirus disease 2019 (COVID-19) pandemic has led to the recognition of a new condition known as postacute sequelae of COVID-19 (PASC), post-COVID-19 condition or long COVID. The World Health Organization’s definition includes any symptoms that present typically within three months after COVID-19 and persist for at least two months1. Common symptoms include fatigue, pulmonary dysfunction, muscle and chest pain, dysautonomia and cognitive disturbances2,3,4,5,6. The incidence of long COVID varies widely, with estimates in severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2)-infected individuals ranging from 10% to 70%7. Long COVID is more common in individuals who have been hospitalized or treated at the intensive care unit due to COVID-19 (refs. 7,8). However, long COVID can also occur in those with initially mild COVID-19 symptoms9. Moreover, several mechanisms may contribute to long COVID, including alterations of the serotonin system that may be related to cognitive changes10, mitochondrial mechanisms to fatigue11 and mechanisms involving complement and platelet activation to vascular disease observed in patients with long COVID12.
The COVID-19 Host Genetics Initiative (COVID-19 HGI) was launched to investigate host genetics in COVID-19 susceptibility, hospitalization and critical illness13,14,15,16. These findings implicate canonical pathways involved in viral entry, mucosal airway defense and type I interferon response15,16,17,18.
To elucidate biological mechanisms behind long COVID, we conducted a genome-wide association study (GWAS) and replication in 33 cohorts across 19 countries, totaling 15,950 individuals with long COVID and 1,892,830 controls (Fig. 1).
The 24 studies contributing to the Long COVID HGI data freeze 4 served as the discovery cohorts for the GWAS meta-analyses. Each color represents a meta-analysis with specific case and control definitions. Strict case definition, long COVID after test-verified SARS-CoV-2 infection; broad case definition, long COVID after any SARS-CoV-2 infection; strict control definition, individuals that had SARS-CoV-2 but did not develop long COVID; broad control definition, population control, that is, all individuals in each study that did not meet the long COVID criteria. Effective sample sizes are shown as the size of each diamond shape, and locations of sample collection in (from left to right) North America, Europe, Middle East and Asia. For more detailed sample sizes, see Supplementary Table 11.
Results
Genetic variants in FOXP4 locus associated with long COVID
We performed a meta-analysis of 24 independent GWAS of long COVID using two case definitions and two control definitions. A strict long COVID case definition required having an earlier test-verified SARS-CoV-2 infection (strict case definition), while a broader long COVID case definition also included self-reported or clinician-diagnosed SARS-CoV-2 infection (broad case definition). The broad definition included all contributing studies, whereas the strict definition included 11 studies (Supplementary Tables 11 and 12). Controls were either population controls, or participants that had recovered from SARS-CoV-2 infection without long COVID (strict control definition; Fig. 1 and Supplementary Tables 11 and 12). Data were obtained from 16 countries, representing populations from six genetic ancestries. The most common symptoms in the questionnaire-based studies were fatigue, shortness of breath and problems with memory and concentration. However, there was some heterogeneity in the frequency of symptoms (Supplementary Fig. 1).
The GWAS meta-analysis using the strict case definition (n = 3,018) and the broad control definition (n = 994,582) identified a genome-wide significant association within the FOXP4 locus (chr6: 41,515,652 G > C, Genome Reference Consortium Human Build 38 (GRCh38), rs9367106, as the lead variant; P = 1.8 × 10−10; Fig. 2 and Supplementary Table 13). The C allele at rs9367106 was associated with an increased risk of long COVID (odds ratio (OR) = 1.63, 95% confidence interval (CI) = 1.40–1.89, risk allele frequency = 4.2%). The association replicated in an independent sample from eight additional contributing cohorts with 5,226 individuals with long COVID and 260,036 population controls (P = 0.025, OR = 1.13, 95% CI = 1.02–1.25; Supplementary Fig. 3d). Furthermore, the lead variants rs9367106 and rs12660421 replicated in the VA Million Veteran Program (MVP) in the strict case analyses with the broad control definition (P = 1 × 10−4, OR = 1.21, 95% CI = 1.10–1.34, long COVID cases, n = 4,274 and controls, n = 538,799; Supplementary Fig. 3e,f) and with the strict control definition (P = 0.0018, OR = 1.17, 95% CI = 1.06–1.29, long COVID cases, n = 4,274 and controls, n = 73,739; Supplementary Fig. 3g,h).
a, Manhattan plot of long COVID after test-verified SARS-CoV-2 infection (strict case definition, n = 3,018) compared to all other individuals in each dataset (population controls, broad control definition, n = 994,582). A genome-wide significant association with long COVID was found in the chromosome 6, upstream of the FOXP4 gene (chr6: 41,515,652 G:C, GRCh38, rs9367106, as the lead variant; P = 1.76 × 10−10, Bonferroni P = 7.06 × 10−10, increased risk with the C allele, OR = 1.63, 95% CI = 1.40–1.89). Horizontal lines indicate genome-wide significance thresholds for IVW meta-analysis before (P < 5 × 10−8, dashed line) and after (1.25 × 10−8) Bonferroni correction over the four long COVID meta-analyses (INCMNSZ = MexGen-COVID Initiative). b, Chromosome 6 lead variant across the contributing studies and ancestries in GWAS meta-analyses of long COVID with strict case definition and broad control definition. Lead variant rs9367106 (solid line) and if missing, imputed by the variant with the highest LD with the lead variant for illustrative purpose, that is, rs12660421 (r = 0.98 in European in 1,000 G + HGDP samples55, dotted lines). For the imputed variants, β was weighted by multiplying by the LD correlation coefficient (r = 0.98). Centre, OR; error bar, 95% CI. Genetic ancestries marked by colors. MAF varies across ancestries, ranging from 1% to 34% (Supplementary Fig. 4). AFR, African; AMR, Admixed American; EAS, East Asian; EUR, European; UKBB, UK Biobank. (Results for the other three GWAS meta-analyses in Supplementary Figs. 2 and 3a–c).
We observed an association, albeit not genome-wide significant, with rs9367106-C and long COVID also in all other three meta-analyses, including our largest meta-analysis with the broad case definition (n = 6,450) and the broad control definition (n = 1,093,995) from 24 studies (OR = 1.34, 95% CI = 1.20–1.49, P = 1.1 × 10−7; Supplementary Figs. 2 and 3). Analyses with the strict case definition (n = 2,964) and strict control definition (n = 37,935; OR = 1.30, 95% CI = 1.09–1.56, P = 3.8 × 10−3), and with the broad case definition (n = 6,396) and strict control definition (n = 46,208; OR = 1.16, 95% CI = 1.02–1.32, P = 0.023), further supported our findings (Supplementary Fig. 3).
To examine the consistency of the FOXP4 signal across the contributing studies, we investigated the effect in each study (Fig. 2b). Genetic variants in the meta-analysis had varying statistical power due to missingness, due to genotyping and imputation quality, and due to differences in allele frequency differences between populations. Therefore, the genetic variant that was present in majority of the studies was the most statistically significant variant, not necessarily because it is the causal variant but because it had the best statistical power. We, therefore, examined the effect size of variants within 30 kb around the lead variant (rs9367106, r2 > 0.01 in individuals of Europeans in the Human Genome Diversity Project19 and 1000 Genomes Project20,21) and effective sample size of at least one-third the sample size of the lead variant. Through this analysis, we identified a haplotype spanning the genomic region chr6:41,512,355–41,537,458 located upstream of FOXP4 gene (Fig. 3d), for which variants had P values less than 5 × 10−7 (Fig. 3a) and effect sizes similar to the lead variant across ancestries (Fig. 3b,c). This analysis identified 15 variants (Supplementary Table 14). Relying on linkage disequilibrium (LD) in the 1000 Genomes Project across African, East Asian European, admixed American and South Asian populations, we found 18 variants cosegregating with the lead variant with tightest LD at the end of the haplotype (r2 > 0.5; Supplementary Table 15). Nine variants overlapped between these two analyses.
Long COVID meta-analysis with strict case (n = 3,018) and broad control (n = 994,582) definition (Fig. 2). X axis shows the position on chromosome 6 (GRCh38). The long COVID lead variant (rs9367106) is depicted with a triangle in each plot. a, Locus zoom plot with each variant colored by effective sample size and showing statistical significance (IVW GWAS meta-analysis −log10 P value) on y axis. b, Each variant colored by statistical significance and showing effect sizes (center, coefficients; error bar, 95% CI on y axis). c, Each variant colored by ancestry and showing LD correlation coefficient (r) with the long COVID lead variant on y axis. d, Ensembl genes in the region (FOXP4 not fully shown; www.ensembl.org)56.
Frequency of long COVID variants varies across ancestries
The allele frequency of rs9367106-C at the FOXP4 locus varied across the study populations ranging from 1.6% in non-Finnish Europeans to 7.1% in Finnish, 19% in admixed Americans and 36% in East Asians (Supplementary Fig. 4; https://gnomad.broadinstitute.org/variant/6-41515652-G-C?dataset=gnomad_r3). Most of the contributing studies comprised individuals of European ancestry (Supplementary Fig. 5). Despite smaller sample sizes, we observed significant associations in admixed American, East Asian and Finnish ancestries (Fig. 2b), owing to the higher allele frequency, and thus larger statistical power to detect an association with the rs9367106 variant in these cohorts.
Risk variants, FOXP4 expression and COVID-19 severity
We next investigated whether the long COVID variants were associated with differential expression of any of the surrounding genes within a 100-kb window (FOXP4, FOXP4-AS1, LINC01276 and MIR4641). We found that rs12660421-A is associated with an increase in FOXP4 expression in the lung (P = 5.3 × 10−9, normalized effect size (NES) = 0.56) and in the hypothalamus (P = 2.6 × 10−6, NES = 1.4; Fig. 4a and Supplementary Fig. 6; GTEx, https://gtexportal.org/home/snp/rs12660421). Furthermore, there were no additional expression quantitative trait loci (eQTL) or colocalization with the expression of FOXP4-AS1 (Supplementary Table 16). FOXP4 (HUGO Gene Nomenclature Committee ID: 20842) is a transcription factor gene that has a broad tissue expression pattern and is expressed in nearly all tissues, with the highest expression in the cervix, the thyroid, the vasculature, the stomach and the testis22. The expression also spans a broad set of cell types, including endothelial lung cells, immune cells and myocytes23. A colocalization analysis suggested that the association signal of long COVID is the same signal that associates with the differential expression of FOXP4 in the lung (posterior probability = 0.91; Supplementary Fig. 7a,b and Supplementary Table 17).
a, The lead variant rs9367106 was not found in the GTEx dataset, but a proxy variant (rs12660421, chr6: 41,520,640) in high LD (r2 = 0.97, rs12660421-A allele is correlated with the long COVID risk allele rs9367106-C) showed a significant eQTL after multiple testing correction, increasing FOXP4 expression in the lung (P = 5.3 × 10−9, NES; expression with GA genotype compared to expression with GG, normalized to 0) = 0.56; GTEx V8 lung samples with GG genotype, n = 483, GA genotype, n = 32; https://gtexportal.org/home/snp/rs12660421). For other tissues, see multitissue eQTL plot in Supplementary Fig. 6. b, Colocalization analysis using eQTL data from GTEx v8 tissue type and long COVID GWAS meta-analysis association data (Supplementary Note). Plots illustrate −log10 P value for long COVID (x axis) and for FOXP4 expression in the lung (y axis), regional association of the FOXP4 locus variants with long COVID (top right) and regional association of the FOXP4 variants with RNA expression measured in the lung in GTEx (bottom right). Variants are colored by 1000 Genomes European-ancestry LD r2 with the lead variant (rs12660421) for FOXP4 expression in lung tissue (the most significant long COVID variant overlapping the GTEx v8 dataset (rs9381074) also annotated). c, Human Protein Atlas RNA single-cell type tissue cluster data (transcript expression levels summarized per gene and cluster) of lung (GSE130148) showing FOXP4 expression in unaffected individuals. The values were visualized using log10 (pTPM + 1) values. Each annotation is taken from the clustering results performed in the Human Protein Atlas. pTPM, protein transcripts per million.
Furthermore, variants in the FOXP4 region have also been identified as risk factors for COVID-19 hospitalization, colocalizing with FOXP4 expression eQTL in the COVID-19 HGI meta-analyses and follow-up studies16,24 (Supplementary Fig. 8 and Supplementary Table 18). Our colocalization analysis demonstrated the FOXP4 association identified here as the same association identified for COVID-19 severity (posterior probability > 0.97; Supplementary Fig. 7e,f and Supplementary Table 17).
FOXP4 expression in blood is associated with long COVID
To understand whether higher FOXP4 expression was seen in long COVID, we collected blood samples from participants with or without active SARS-CoV-2 infection. We discovered that the higher FOXP4 levels in nonacute COVID-19 samples were associated with increased risk of long COVID (OR = 2.31 per 1 s.d. increase in FOXP4 expression, 95% CI = 1.27–4.22, P = 0.0063; Supplementary Fig. 9), while FOXP4 levels in acute COVID-19 samples were not associated with long COVID (P = 0.62). This is orthogonal evidence to the genetic signal that higher FOXP4 levels may lead to long COVID.
FOXP4 expression in alveolar and immune cells in the lung
As lung tissue consists of several cell types, we wanted to elucidate the relevant cells that express FOXP4 and may contribute to long COVID. We analyzed single-cell sequencing data from the Tabula Sapiens, a previously published atlas of single-cell sequencing data in healthy individuals free of COVID-1925. FOXP4 expression was the highest in type 2 alveolar cells in individuals without SARS-CoV-2 infection (Fig. 4c) and during active infection (Supplementary Fig. 10), suggesting that SARS-CoV-2 infection was not required for FOXP4 expression. Furthermore, type 2 alveolar cells are capable of mounting robust innate immune responses, thus participating in the immune regulation in the lung. Additionally, type 2 alveolar cells secrete surfactant, keep the alveoli free from fluid, and serve as progenitor cells repopulating damaged epithelium after injury26. In addition, we observed nearly equally high expression of FOXP4 in granulocytes that similarly participate in the regulation of innate immune responses. Overall, the findings suggest a possible role of both immune and alveolar cells in the lung and higher expression of FOXP4 in long COVID.
FOXP4 variants located at active chromatin in the lung
To understand the possible causal variation at the FOXP4 locus, we performed statistical fine mapping using SLALOM27 (Supplementary Note). There were nine variants within the 95% credible set with the maximum posterior probability of 0.28 for rs9381074 (Supplementary Fig. 11). Given the strong LD pattern among the nine variants within the credible set, fine mapping alone might not be able to pinpoint a single causal variant in this locus. Therefore, to understand possible functional regulatory effects behind the variant association, we used the data from the Regulome database28,29, ENCODE30 and VannoPortal31. While the majority of the long COVID variants were at active enhancer or transcription factor binding sites, four variants had direct evidence of transcription factor binding based on chromatin immunoprecipitation sequencing experiments (Supplementary Tables 19 and 20). One of these variants (rs9381074) was directly located on a region that had DNA methylation marks across multiple tissues, including immune and lung cells (H3K27me3 and H3K4me1, H3K4me3, H3K27ac, H3K4me2 and H3K4me3), and had evidence of transcriptional activity from 49 different transcription factors, of which we saw the most consistent direct binding of FOXA1 across 55 experiments. Furthermore, we downloaded DNase sequencing data from the ENCODE project and observed that rs9381074 was directly positioned on a DNase hypersensitivity site in the lung (Supplementary Note). Finally, this variant is the same variant implicated by statistical fine mapping, suggesting the rs9381074 variant as the causal variant for association at the FOXP4 locus.
FOXP4 variant associated with lung cancer
To understand the role of FOXP4 and its associations across diseases, we performed phenome-wide association analysis. We first focused on Biobank Japan32, as the long COVID risk allele frequency is highest in East Asia. Phenome-wide association study (PheWAS) between rs9367106 and all phenotypes in Biobank Japan (n = 262) revealed that long COVID risk allele was associated with lung cancer (P = 1.2 × 10−6, Bonferroni P = 3.1 × 10−4, OR = 1.13, 95% CI = 1.07–1.18; Supplementary Fig. 8 and Supplementary Table 18). Furthermore, the long COVID risk allele is in LD with the known risk variants for non-small cell lung carcinoma in Chinese and European populations33 (rs1853837, r2 = 0.88 in East Asians34) and for lung cancer in never-smoking Asian women35 (rs7741164, r2 = 0.98 in East Asians34). Colocalization analysis supported that the associations in this locus (within 500 kb of rs9367106) for long COVID and lung cancer shared the same genetic signal (colocalization posterior probability = 0.98; Supplementary Fig. 7c,d). COVID-19 phenotypes and lung cancer traits were the only associations found with linked variants in the GWAS Catalog (Supplementary Table 21).
We then broadened the analysis to other cohorts. Using data from FinnGen and Open Targets, we observed a robust gene level PheWAS association with prostate cancer, immune traits including reticulocytes and chronotype (Supplementary Tables 22–24). Moreover, colocalization analysis provided by Open Targets showed that FOXP4 expression and FOXP4 splice QTLs colocalized with blood count traits specifically in the blood and the thyroid, but the blood count traits did not colocalize with the expression in the lung (Supplementary Table 25). These findings suggest that separate regulatory variation may contribute to tissue-specific expression and the control of otherwise ubiquitously expressed FOXP4 and contribute to trait associations in a tissue-specific manner.
Long COVID and other phenotypes
We investigated the relationship between long COVID and cardiometabolic, behavioral and psychiatric traits36 (Fig. 5 and Supplementary Table 26). We found positive genetic correlations between long COVID and insomnia symptoms, depression, risk tolerance, asthma, diabetes and SARS-CoV-2 infection, while we saw negative correlations with red and white blood cell counts (Fig. 5a). However, identified correlations were only nominally significant without multiple testing correction (P < 0.05; Supplementary Table 27). The observed scale heritability estimates of long COVID ranged from 0.97% to 12.36% (s.e. = 0.0362), with the highest heritability in the strict case and strict control definitions (Supplementary Table 28).
a,b, LD score regression (a, LDSC, top; Supplementary Table 27) and IVW MR (b, fixed-effects model,bottom; Supplementary Table 29 and Supplementary Data) were used for calculating two-sided P values. The size of each colored square corresponds to statistical significance (***P < 0.0001, full-sized square; **P < 0.01, full-sized square; *P < 0.05, full-sized square; P < 0.1, large square; P < 0.5, medium square and P > 0.5, small square; not corrected for multiple comparisons). A full list of traits is provided in Supplementary Table 26. For sample sizes in each long COVID GWAS meta-analysis using strict (S) or broad (B) case and control definitions, see Supplementary Table 11. c, MR scatter plot with effect sizes (β ± s.e.) of each variant on COVID-19 susceptibility (reported SARS-CoV-2 infection) as exposure and long COVID (strict case, broad control definition) as outcome (P (IVW, fixed effects) = 1.8 × 10−7, pleiotropy P = 0.47; Supplementary Table 30). d, Similarly, MR with COVID-19 hospitalization as exposure and long COVID as outcome (P (IVW fixed effects) = 4.8 × 10−8, pleiotropy P = 0.83; Supplementary Table 30). e, Analysis of shared and unique effects between SARS-CoV-2 infection susceptibility and long COVID using a Bayesian mixture model showed ABO and 3p21.31 rs73062389 as having shared effects (posterior probability > 0.99). FOXP4 variant association was discovered in the long COVID meta-analyses but showed also an effect on the susceptibility of the initial infection, though smaller than on long COVID (Supplementary Table 34). (Effects shown as β, error bars represent 95% confidence intervals.) f, Similarly, analysis of shared and unique effects between COVID-19 severity and long COVID using a Bayesian mixture model showed FOXP4 variant with a joint effect (posterior probability > 0.9), differing from the other severity variants due to its larger effect on long COVID (Supplementary Table 35). BMI, body mass index; CRP, C-reactive protein; eGFR, estimated glomerular filtration rate; ADHD, attention-deficit hyperactivity disorder.
We used Mendelian randomization (MR) to estimate potential risk factors by analyzing the same traits mentioned above (Supplementary Table 26). Genetically predicted earlier smoking initiation (P = 0.022), more cigarettes consumed per day (P = 0.046), higher levels of high-density lipoproteins (P = 0.029) and higher body mass index (P = 0.046) were nominally significant causal risk factors of long COVID (Fig. 5b and Supplementary Table 29). However, none of these associations survived correction for multiple comparisons.
FOXP4 signal not explained simply by COVID-19 severity
Earlier research has suggested that COVID-19 severity is a risk factor for long COVID8,37,38,39 and FOXP4 variants have earlier been implicated in COVID-19 severity6. Our initial GWAS and robust replication across different cohorts show FOXP4 variants also associated with long COVID. However, the results pose an interesting question of whether the mechanism of FOXP4 association with long COVID is the same mechanism that contributes to COVID-19 severity. We thus investigated the relationship between COVID-19 hospitalization and long COVID by performing a two-sample MR (Supplementary Table 30). In terms of causality, we caution that COVID-19 hospitalization as causal exposure is difficult to interpret because both long COVID and COVID-19 hospitalization are two outcomes of the same underlying infection. Nevertheless, the relationship between the effect size for long COVID versus the effect size for COVID-19 severity can shed some light on the role of COVID-19 severity in long COVID. To perform two-sample MR without overlapping samples, we have excluded the studies that contributed to the current long COVID freeze 4 and computed a meta-analysis of SARS-CoV-2 infection susceptibility and COVID-19 hospitalization of the remaining cohorts in the COVID-19 HGI. We observed a causal relationship of susceptibility and hospitalization on long COVID (strict case and broad control definition; inverse variance-weighted (IVW) MR, P = 1.8 × 10−7 for infection and P = 4.8 × 10−8 for hospitalization) with no evidence of pleiotropy (MR–Egger intercept P = 0.47 and 0.83, respectively; Fig. 5c,d and Supplementary Table 30). Furthermore, sensitivity analysis by leaving one variant out (Supplementary Table 31), or by including long COVID cohorts with European-ancestry only (Supplementary Table 32), both supported a robust causal association between COVID hospitalization and long COVID. Nevertheless, the Wald ratio of long COVID to COVID-19 hospitalization for the FOXP4 variant is 1.97 (95% CI = 1.36–2.57), which is significantly greater than the slope of the MR-estimated relationship between COVID-19 hospitalization and long COVID (0.35, 95% CI = 0.12–0.57). Furthermore, adjusting or stratifying the long COVID GWAS for hospitalization did not explain the association between FOXP4 and long COVID (Supplementary Table 33a).
Thus, the FOXP4 signal demonstrates a stronger association with long COVID than expected, meaning that it cannot simply be explained by its association with either susceptibility or severity of the acute disease alone (Fig. 5c,d). A recent systematic review of epidemiological data found a positive association between COVID-19 hospitalization and long COVID with a relationship on a log-odds scale of 0.91 (95% CI = 0.68–1.14)40. Even assuming this stronger relationship between COVID-19 hospitalization and long COVID, the observed effect of the FOXP4 variant on long COVID still exceeds what would be expected based on the association with severity alone.
When SARS-CoV-2 infection is required for COVID-19 disease, and for severe COVID-19, an important question is whether all genetic variants that increase COVID-19 susceptibility or severity are equally large risk factors for long COVID. Bayesian methods provide an opportunity to estimate whether some variants that affect COVID-19 susceptibility or severity systematically contribute to the risk of long COVID more than the other variants. To answer this question, we estimated the posterior probabilities for all susceptibility and severity variants for long COVID using four models—susceptibility/severity only, long COVID only and two models for joint effects that differed in their slopes. We observed that for COVID-19 susceptibility, the 3p21.31 locus and the ABO locus contributed to both susceptibility and long COVID with a high posterior probability (Fig. 5e and Supplementary Table 34). Moreover, while many severity variants are also likely to contribute to long COVID, their slope between long COVID and severity effects was smaller than that of FOXP4 (Fig. 5f and Supplementary Table 35).
Finally, previous studies have shown a potential effect of vaccination, strain and severity on long COVID5,7,41,42,43,44. To clarify these factors with long COVID, we used data from additional cohorts, including FinnGen. We observed that, while adjusting for severity or vaccination status did not remove the signal, there was a possible stronger risk of FOXP4 risk alleles before vaccination and with wild-type and Alpha strains (Supplementary Table 33b,c). A significant association of the FOXP4 locus with long COVID in individuals before vaccination was observed. Although the effect remained positive postvaccination (OR = 1.3), the lack of significant association in these cases may be influenced by the relatively small sample size of individuals diagnosed with long COVID after vaccination (n = 40; Supplementary Table 33b). Earlier epidemiological studies have shown that immunization against COVID-19 is associated with a reduced risk of long COVID43,44,45. Our data are in line with these earlier observations. Furthermore, we sought replication for the strain association in the Estonian Biobank, where higher risk was also observed with earlier strains, particularly the Alpha strain (P = 0.0138).
The possible time-dependent association with strain prompted us to explore the temporal relationship between FOXP4 and long COVID from the start of the year 2020 till the spring of 2023. Using data from 3,684 individuals with long COVID from FinnGen, we observed a significant temporal association with the Cox proportional hazards model (HR = 1.3, 95% CI = 1.1–1.7, P = 0.005, npopulation controls = 496,664; Supplementary Fig. 12). Moreover, particularly homozygosity for the FOXP4 risk allele increased the risk for long COVID (recessive P = 2.3 × 10−4, OR = 5.64, 95% CI = 2.25–14.17). Moreover, we observed a consistently higher risk allele homozygosity among long COVID cases in the Estonian Biobank and MexGene-COVID (Supplementary Note). Overall, these results indicate a temporal relationship with FOXP4 risk variants on long COVID and higher risk with homozygosity and earlier viral strains. In all these analyses, FOXP4 stood out as an independent risk factor for long COVID.
FOXP4 associates with multiple symptoms of long COVID
We aimed to investigate the symptomatic associations between FOXP4 and long COVID. We focused on well-established components of long COVID as documented in earlier literature7. Using symptom data from the two largest cohorts, FinnGen and MVP, we re-examined the association of FOXP4 with long COVID, requiring lifetime symptoms from any of the previously identified subtypes. Our analysis revealed consistent associations across both MVP and FinnGen cohorts, with fatigue and asthma diagnoses, and β-adrenergic and proton pump inhibitor medication showing significant associations in the meta-analysis of the two cohorts (Supplementary Fig. 13 and Supplementary Table 36). The replication of these associations in datasets from two different countries, with distinct healthcare settings and patient populations, strengthens the robustness of the link between FOXP4 and the plethora of manifestations of long COVID.
Discussion
In this study, we aimed to understand the host genetic factors that contribute to long COVID, using data from 24 studies across 16 countries and replicating in independent cohorts. Our analysis identified genetic variants within the FOXP4 locus as a risk factor for long COVID. The FOXP4 gene is expressed in the lung and the genetic variants associated with long COVID are also associated with differential expression of FOXP4 and with lung cancer and COVID-19 severity. Additionally, using MR, we characterized COVID-19 severity as a causal risk factor for long COVID. Overall, our findings provide genomic evidence consistent with previous epidemiological and clinical reports of long COVID, indicating that long COVID, similarly to other postviral conditions, is a heterogeneous disease entity where likely both individual genetic variants and the environmental risk factors contribute to disease risk.
Our analysis revealed a connection between long COVID and pulmonary endpoints through both individual variants at FOXP4, a transcription factor-coding gene previously linked to lung cancer and COVID-19 severity24, and MR analysis identifying smoking and COVID-19 severity as risk factors. Furthermore, expression analysis of the lung, and cell type-specific single-cell sequencing analysis, showed FOXP4 expression in both alveolar cell types and immune cells of the lung.
FOXP4 belongs to the subfamily P of the forkhead box transcription factor family genes and is expressed in various tissues, including the lungs and the gut45,46. Moreover, it is highly expressed in mucus-secreting cells of the stomach and intestines47, as well as in naïve B, natural killer and memory Treg cells48, and required for normal T cell memory function following infection49. FOXP1/FOXP2/FOXP4 are also required for promoting lung endoderm development by repressing expression of nonpulmonary transcription factors50, and the loss of FOXP1/FOXP4 adversely affects airway epithelial regeneration51. Furthermore, FOXP4 has been implicated in airway fibrosis52 and the promotion of lung cancer growth and invasion53. We find that the variants associated with long COVID are also associated with lung cancer in Biobank Japan32. These observations together with the present study may suggest that the connection between FOXP4 and long COVID may be rooted in both lung function and immunology. Furthermore, FOXP4 expression in both alveolar and immune cells in the lung, and the association with severe COVID-19 and pulmonary diseases such as cancer, suggests that FOXP4 may participate in local immune responses in the lung.
Our functional analysis further implicated FOXP4 as a risk factor for long COVID, irrespective of the genotype status of the here-identified risk variant. FOXP4 expression levels were higher in individuals with long COVID than controls. Furthermore, we observed a consistent effect of FOXP4 risk variants across ancestries. Moreover, having multiple ancestries enabled us to fine-map a likely causal variant at rs9381074, which was further supported by functional methylation and expression data.
We also discovered a causal relationship between SARS-CoV-2 infection and long COVID, as expected, and an additional causal risk between severe, hospital treatment-requiring COVID-19 and long COVID. This finding is in agreement with earlier epidemiological observations8,37,38,39. The relationship between COVID-19 severity and long COVID raises an interesting question—when SARS-CoV-2 infection is required for both COVID-19 and severe COVID-19, are all genetic variants that increase COVID-19 susceptibility or severity equally large risk factors for long COVID? In the present study, we aimed to answer this question by examining variant effect sizes between SARS-CoV-2 infection susceptibility, COVID-19 severity and long COVID using stratified and adjusted analyses, and by Bayesian modeling. Among the known SARS-CoV-2 susceptibility loci, ABO and 3p21.31 had a high probability of also contributing to long COVID. Moreover, the FOXP4 variants had higher effect sizes for long COVID than expected based on the other severity variants, suggesting an independent role of FOXP4 for long COVID that was not observed among the other COVID-19 severity variants. Such observation offers clues on biological mechanisms, such as FOXP4 affecting pulmonary function and immunity, which then contribute to the development of long COVID. Overall, our study elucidates genetic risk factors for long COVID, the relationship between long COVID and severe COVID-19, and finally possible mechanisms of how FOXP4 contributes to the risk of long COVID.
Moreover, while several lines of evidence from the original GWAS association, replication, stratified analyses to Bayesian analysis and the significance of individual variants suggest that FOXP4 contributes to long COVID in a stronger way than expected, the mechanism that FOXP4 associates with long COVID may be the same mechanism that contributes to COVID-19 severity. Future studies and iterations of this work will likely grow the number of observed genetic variants and further clarify the biological mechanisms underlying long COVID. We also caution that the genetic predisposition to long COVID might be dependent on SARS-CoV-2 variation and vaccination status, and that a large portion of our data was collected before the omicron wave and widespread vaccination (Supplementary Table 12), which might have an impact on the genetic associations.
The contribution of genetic factors to COVID-19 phenotypes is intriguing. As heritability in general is defined as the proportion of phenotypic variation attributable to genetic differences within a specific environment, in a hypothetical world where every environmental factor would be similar, heritability would theoretically approach 100%. However, as the heritability in infections can be shaped by exposure, viral strain, prophylactics, earlier immunity, for example, through vaccination efforts, or differences in diagnostic criteria, reporting or local recommendations, estimating heritability requires relatively large samples for precise estimates. Similarly, heritability in earlier studies of COVID-19 phenotypes was initially less than 1% for COVID-19 susceptibility, severity and critical illness even with over 46,000 COVID-19 cases and 2 million controls6. However, all COVID-19 traits showed robust genetic correlations with the known COVID-19 epidemiological risk factors. In our study, we similarly see low heritability with long COVID, which is a limitation in the current study. Nonetheless, the estimate provides a tool to understand between-trait correlations and will likely become more precise with larger sample sizes.
We recognize that the symptomatology of long COVID is variable and includes, in addition to lung symptoms, also other symptom domains such as fatigue and cognitive dysfunction7,37,54. In addition, the long-term effects of COVID-19 are still being studied, and more research is needed to understand the full extent of the long-term damage caused by SARS-CoV-2 and long COVID disease. We also recognize that the long COVID diagnosis is still evolving. Nevertheless, our study provides direct genetic evidence that lung pathophysiology can have an integral part in the development of long COVID.
Methods
Contributing studies
Participants of each of the contributing 33 studies provided written informed consent to participate in each respective study, with recruitment and ethics following study-specific protocols approved by their respective institutional review boards (details are provided in Supplementary Table 12).
For the initial discovery analysis, we used data from the following 24 studies: Avon Longitudinal Study of Parents and Children (ALSPAC), Bonn Study of COVID Genetics (BoSCO), Banque québécoise de la COVID-19 (BQC19), Danish Blood Donor Study (DBDS), Extended Cohort for E-health, Environment and DNA (EXCEED), FinnGen, GCAT | Genomes for life, Genetic Bases of COVID-19 Clinical Variability (GEN-COVID), Genotek, Genetics of Long COVID (GOLD), Helix Exome+ and Healthy Nevada Project COVID-19 Phenotypes (Helix), MexGen-COVID Initiative, COVID-19 Ioannina Biobank (Ioannina), Genome-wide assessment of the gene variants associated with severe COVID-19 phenotype in Iran (IrCovid), Japan COVID-19 Task Force (JapanTaskForce), Lifelines, Norwegian Mother, Father and Child Cohort Study (MoBa), Mount Sinai COVID Biobank (MSCIC), Penn Medicine BioBank (PMBB), Follow-UP study of patients with critical COVID-19/COVID-19 Cohort Study of the University Hospital of the Technical University Munich (SweCovid/COMRI), Tirschenreuth Study (TiKoCo), TwinsUK, UK Biobank and Understanding Society—UK Household Longitudinal Study. The total sample size of this Long COVID HGI data freeze 4 was 6,450 long COVID cases, 46,208 COVID-19-positive controls and 1,093,955 population controls (Supplementary Table 12). For the replication of the FOXP4 lead variants, we obtained data from the following nine additional studies: COVID-19 cohort at LGDB (LatviaGDB), COVID-19 Genomics Network (C19-GenoNet), COVID-19 Host Immune Response Pathogenesis Study (CHIRP), Estonian Biobank (EstBB), Fondazione Genomics SARS-CoV-2 Study (FoGS), GENCOV Study (GENCOV), Mass General Brigham Biobank (MGB), The Post-hospitalization COVID-19 study (PHOSP-COVID) and VA MVP. The replication datasets together comprised 9,500 individuals with long COVID and 798,835 population controls (Supplementary Fig. 3d,e and Supplementary Table 12).
The effective sample sizes for each study shown in Fig. 1 were calculated for display using the given formula: (4 × ncase × ncontrol)/(ncase + ncontrol). The Long COVID HGI is a global and ongoing collaboration, open to all studies around the world that have data to run long COVID GWAS using our phenotypic criteria described below.
Phenotype definitions
We used the following criteria for assigning case–control status for long COVID aligning with the World Health Organization guidelines1 (Supplementary Note; https://github.com/long-covid-hg/LongCovidTools/blob/main/PhenotypeDefinitions_LongCOVID_v1.docx). Study participants were defined as long COVID cases if, at least three months since SARS-CoV-2 infection or COVID-19 onset, they met any of the following criteria:
-
1.
Presence of one or more self-reported COVID-19 symptoms that cannot be explained by an alternative diagnosis
-
2.
Report of ongoing substantial impact on day-to-day activities
-
3.
Any diagnosis codes of long COVID (for example, post-COVID-19 condition, ICD-10 code U09(.9))
Criteria 1 and 2 were applied only to questionnaire-based cohorts, whereas 3 was used in studies with electronic health records (EHR). Detailed phenotyping criteria and diagnosis codes of each study are provided in Supplementary Table 12.
We used two long COVID case definitions, a strict definition requiring a test-verified SARS-CoV-2 infection and a broad definition including self-reported or clinician-diagnosed SARS-CoV-2 infection (any long COVID).
We applied two control definitions. First, we used population controls, that is, everybody that is not the case. Population controls were genetic ancestry-matched individuals who were not defined as long COVID cases using the above-mentioned questionnaire or EHR-based definition. In the second analysis, we compared long COVID cases to individuals who had had SARS-CoV-2 infection but who did not meet the criteria of long COVID, that is, had fully recovered within three months from the infection.
We used in total four different case–control definitions to generate four GWASs as below:
-
1.
Long COVID cases after test-verified SARS-CoV-2 infection versus population controls (the strict case definition versus the broad control definition)
-
2.
Long COVID within test-verified SARS-CoV-2 infection (the strict case definition versus the strict control definition)
-
3.
Any long COVID cases versus population controls (the broad case definition versus the broad control definition)
-
4.
Long COVID within any SARS-CoV-2 infection (the broad case definition versus the strict control definition)
To further investigate the effect of FOXP4 locus on the different manifestations of long COVID7 in the FinnGen and MVP datasets, we used combined criteria of any long COVID diagnosis (BB: ICD-10 diagnosis code: U09* (where * can be empty or any string, referring to subdiagnoses)) with lifetime occurrence of specific symptom diagnoses: diabetes (ICD-10: E10*, E11*, E12*, E13*, E14*), fatigue and malaise (ICD-10: R53*, G93.3), asthma (ICD-10: J45*), skin paresthesia (ICD-10: R20.2), β-adrenergic inhalants (Anatomical Therapeutic Chemical (ATC) drug code: R03AC*), headache (ICD-10: R51*), proton pump inhibitors (ATC: A02BC*) or cardiac arrhythmia/abnormalities of heartbeat (ICD-10: I49*, R00*; Supplementary Fig. 13 and Supplementary Table 36). The effect of the risk variant rs9367106-C on long COVID with each symptom or medication was estimated separately using logistic regression, adjusting for age, sex and ten principal components. Finnish ancestry from FinnGen and African, Admixed American and European ancestries from the MVP were first analyzed separately, followed by a meta-analysis and test for heterogeneity.
GWAS
We largely applied the GWAS analysis plans used in the COVID-19 HGI6. Each study performed its own sample collection, genotyping, genotype and sample quality control, imputation and association analyses independently, according to our central analysis plan (https://github.com/long-covid-hg/LongCovidTools/blob/main/COVID19HostGenetics_AnalysisPlan_LongCOVID_v1.docx), before submitting the GWAS summary statistic level results for meta-analysis (details are provided in Supplementary Table 12). We recommended that GWASs were run using REGENIE57 on chromosomes 1–22 and X, although a minority of the contributing studies used SAIGE58 or PLINK2 (ref. 59; Supplementary Table 12). The minimum set of covariates to be included at runtime were age, age2, sex, age × sex and the first ten genetic principal components. We advised studies to include any additional study-specific covariates where needed, such as those related to genotype batches or other demographic and technical factors that could lead to stratification within the cohort. Studies (n = 2) performing the GWAS using software that does not account for sample relatedness (such as PLINK) were advised to exclude related individuals.
GWAS meta-analyses
The meta-analysis pipeline was also adopted from the COVID-19 HGI flagship paper16. The code is available at Long COVID HGI GitHub (https://github.com/long-covid-hg/META_ANALYSIS/) and is a modified version of the pipeline developed for the COVID-19 HGI (https://github.com/covid19-hg/META_ANALYSIS). To ensure that individual study results did not suffer from excessive inflation, deflation and false positives, we manually investigated plots of the reported allele frequencies against aggregated gnomAD v3.0 (ref. 55) allele frequencies in the same population. We also evaluated whether the association standard errors were excessively small, given the calculated effective sample size, to identify studies deviating from the expected trend. Where these issues were detected, the studies were contacted to reperform the association analysis, if needed, and resubmit their results.
Before the meta-analysis itself, the summary statistics were standardized, filtered (excluding variants with allele frequency <0.1% or imputation INFO score <0.6), lifted over to reference genome build GRCh38 (in studies imputed to GRCh37) and harmonized to gnomAD v3.0 through matching by chromosome, position and alleles (Supplementary Note).
The meta-analysis was performed using a fixed-effects IVW method on variants that were present in at least two studies contributing to the specific phenotype being analyzed. To assess whether one study was primarily driving any associations, we simultaneously ran a leave-most-significant-study-out (LMSSO) meta-analysis for each variant (based on the variant’s study-level P value). Heterogeneity between studies was estimated using Cochran’s Q test60. Each set of meta-analysis results was then filtered to exclude variants whose total effective sample size (in the non-LMSSO analysis) was less than one-third of the total effective sample size of all studies contributing to that meta-analysis. We report significant loci that pass the genome-wide significance threshold (P ≤ 5 × 10−8/4 = 1.25 × 10−8) accounting for the number of GWAS meta-analyses we performed.
Principal component projection
In a similar fashion to the COVID-19 HGI, we asked each study to project their cohort onto a multiethnic genetic principal component space (Supplementary Fig. 5), by providing studies with precomputed PC loadings and reference allele frequencies from unrelated samples from the 1000 Genomes Project20,21 and the Human Genome Diversity Project. The loadings and frequencies were generated for a set of 117,221 autosomal, common (minor allele frequency (MAF) ≥ 0.1%) and LD-pruned (r2 < 0.8; 500-kb window) SNPs that would be available in the imputed data of most studies. Access to the projecting and plotting scripts was made available to the studies at https://github.com/long-covid-hg/pca_projection.
eQTL, PheWAS and colocalization
For the single (Bonferroni-corrected) genome-wide significant lead variant, rs9367106, we used the GTEx portal (https://gtexportal.org/)22,23 to understand whether this variant had any tissue-specific effects on gene expression. As rs9367106 was not available in the GTEx database, we first identified a proxy variant, rs12660421 (r2 = 0.90) using all individuals from the 1000 Genomes Project20,21 and then performed a lookup in the portal’s GTEx v8 dataset23.
To identify other phenotypes associated with rs9367106, we used the Biobank Japan PheWeb portal (https://pheweb.jp/)9 to perform a phenome-wide association analysis, as the MAF of rs9367106 is highest in East Asia. Furthermore, we explored variant and locus-level associations in Estonian Biobank, FinnGen and Open Targets.
To assess whether the FOXP4 association is shared between long COVID, and tissue-specific eQTLs, lung cancer and COVID-19 hospitalization, we extracted a 1-Mb region centered on rs9367107 (chr6: 41,015,652–42,015,652) from the lung cancer and COVID-19 hospitalization summary statistics and the GTEx v8 data and performed colocalization analyses using the R package coloc (v5.1.0.1)61,62 in R v4.2.2. Colocalization locus zoom plots were created using the LocusCompareR R package v1.0.0 (ref. 63), with LD r2 estimated using 1000 Genomes European-ancestry individuals20,21.
Genetic correlation and MR
We assessed the genetic overlap and causal associations between long COVID outcomes and the same set of risk factors, biomarkers and disease liabilities as in the COVID-19 HGI flagship paper16. Additionally, we tested the overlap and causal impact of COVID-19 susceptibility and hospitalization risk. Genetic correlations were assessed using Linkage Disequilibrium Score Regression v1.0.1 (ref. 64). Where there were sufficient genome-wide significant variants, the causal impact was tested in a two-sample MR framework using the TwoSampleMR (v0.5.6) R package65 with R v4.0.3. To avoid sample overlap between exposure GWASs (here COVID-19 hospitalization and SARS-CoV-2 reported infection) and outcome GWASs (here long COVID phenotypes), we performed meta-analyses of COVID-19 hospitalization and SARS-CoV-2 reported infection using data freeze 7 of the COVID-19 HGI by excluding studies that participated in the long COVID (data freeze 4) effort. Independent significant exposure variants with P ≤ 5 × 10−8 were identified by LD-clumping the full set of summary statistics using an LD r2 threshold of 0.001 (based on the 1000 Genomes European-ancestry reference samples20,21) and a 10-Mb clumping window. For each exposure–outcome pair, these variants were then harmonized to remove variants with mismatched alleles and ambiguous palindromic variants (MAF > 45%). Fixed-effects IVW meta-analysis was used as the primary MR method, with MR–Egger, weighted median estimator, weighted mode-based estimator and MR-PRESSO used in sensitivity analyses. Heterogeneity was assessed using the MR-PRESSO global test and pleiotropy using the MR–Egger intercept. The genetic correlation and MR analyses were implemented as a Snakemake Workflow made available at https://github.com/marcoralab/MRcovid. Leave-one-variant-out-MR and European-only long COVID analyses were run as sensitivity analyses to test the robustness of MR results with COVID hospitalization as exposure and long COVID as outcome.
Summaries of the exposure GWAS are provided in Supplementary Table 26, and the association statistics for all exposure variants are provided in Supplementary Data.
Bayesian clustering of effects based on linear relationships
We compared effect size estimates between long COVID and COVID severity, and similarly, between long COVID and SARS-CoV-2 infection. COVID-19 hospitalization was used as a proxy for severity. For this purpose, we selected those variants that had earlier association evidence at the genome-wide significant level for COVID-19 severity or SARS-CoV-2 infection and examined whether these variants had joint or higher effect than expected for long COVID. The linemodels R package was utilized for comparing linear relationships (https://github.com/mjpirinen/linemodels)66. This line model method performs probabilistic clustering of variables based on their observed effect sizes on two outcomes (Supplementary Note).
Statistics and reproducibility
To maximize the statistical power for detecting genetic variants associated with long COVID, we used data from as many cohorts as possible with information on long COVID and study participants without long COVID. Moreover, to ensure reproducibility, we examined the robustness and replication of the signal across nine independent cohorts that joined the Long COVID HGI after data freeze 4 where the association was initially discovered.
For additional methodological details, see Supplementary Note.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
We have made the results of these GWAS meta-analyses publicly available for variants passing post-meta-analysis filtering for MAF ≥ 1% and effective sample size >1/3 of the maximum effective sample size for each meta-analysis. The results from the four meta-analyses have been deposited to GWAS Catalog67 and LocusZoom68, where the associations can be visually explored and the summary statistics exported for further scientific discovery.
Strict case definition (long COVID after test-verified SARS-CoV-2 infection) versus broad control definition (population control):
https://www.ebi.ac.uk/gwas/studies/GCST90454540
https://my.locuszoom.org/gwas/192226/
Broad case definition (long COVID after any SARS-CoV-2 infection) versus broad control definition:
https://www.ebi.ac.uk/gwas/studies/GCST90454541
https://my.locuszoom.org/gwas/826733/
Strict case definition versus strict control definition (individuals that had SARS-CoV-2 but did not develop long COVID):
https://www.ebi.ac.uk/gwas/studies/GCST90454542
https://my.locuszoom.org/gwas/793752/
Broad case definition versus strict control definition:
Code availability
Instructions and example code for phenotyping, sample collection, genotyping, genotype and sample quality control, imputation and association analyses are shared in our central analysis plan (https://github.com/long-covid-hg/LongCovidTools/blob/main/COVID19HostGenetics_AnalysisPlan_LongCOVID_v1.docx, https://github.com/long-covid-hg/LongCovidTools/blob/main/PhenotypeDefinitions_LongCOVID_v1.docx). Furthermore, we have used GitHub public repositories for providing code for GWAS summary statistics lift-over and meta-analyses (https://github.com/long-covid-hg/META_ANALYSIS, modified from the previously published COVID-19 HGI pipeline15,16), for PCA projecting and plotting (https://github.com/long-covid-hg/pca_projection) and for MR and genetic correlation (https://github.com/marcoralab/MRcovid). Code used for fine mapping (https://github.com/mkanai/slalom)27 and Bayesian clustering of effects based on linear relationships (https://github.com/mjpirinen/linemodels)66 is also publicly available and has been previously published.
References
Soriano, J. B., Murthy, S., Marshall, J. C., Relan, P. & Diaz, J. V. A clinical case definition of post-COVID-19 condition by a Delphi consensus. Lancet Infect. Dis. 22, e102–e107 (2022).
Desai, A. D., Lavelle, M., Boursiquot, B. C. & Wan, E. Y. Long-term complications of COVID-19. Am. J. Physiol. Cell Physiol. 322, C1–C11 (2022).
Mehandru, S. & Merad, M. Pathological sequelae of long-haul COVID. Nat. Immunol. 23, 194–202 (2022).
Hugon, J., Msika, E.-F., Queneau, M., Farid, K. & Paquet, C. Long COVID: cognitive complaints (brain fog) and dysfunction of the cingulate cortex. J. Neurol. 269, 44–46 (2022).
Ceban, F. et al. Fatigue and cognitive impairment in post-COVID-19 syndrome: a systematic review and meta-analysis. Brain Behav. Immun. 101, 93–135 (2022).
Sykes, D. L. et al. Post-COVID-19 symptom burden: what is long-COVID and how should we manage it? Lung 199, 113–119 (2021).
Davis, H. E., McCorkell, L., Vogel, J. M. & Topol, E. J. Long COVID: major findings, mechanisms and recommendations. Nat. Rev. Microbiol. 21, 133–146 (2023).
Global Burden of Disease Long COVID Collaborators. et al. Estimated global proportions of individuals with persistent fatigue, cognitive, and respiratory symptom clusters following symptomatic COVID-19 in 2020 and 2021. JAMA 328, 1604–1615 (2022).
Mizrahi, B. et al. Long COVID outcomes at one year after mild SARS-CoV-2 infection: nationwide cohort study. BMJ 380, e072529 (2023).
Wong, A. C. et al. Serotonin reduction in post-acute sequelae of viral infection. Cell 186, 4851–4867 (2023).
Appelman, B. et al. Muscle abnormalities worsen after post-exertional malaise in long COVID. Nat. Commun. 15, 17 (2024).
Cervia-Hasler, C. et al. Persistent complement dysregulation with signs of thromboinflammation in active long COVID. Science 383, eadg7942 (2024).
The COVID-19 Host Genetics Initiative The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic. Eur. J. Hum. Genet. 28, 715–718 (2020).
Nakanishi, T. et al. Age-dependent impact of the major common genetic risk factor for COVID-19 on severity and mortality. J. Clin. Invest. 131, e152386 (2021).
Kanai, M. et al. A second update on mapping the human genetic architecture of COVID-19. Nature 621, E7–E26 (2023).
COVID-19 Host Genetics Initiative Mapping the human genetic architecture of COVID-19. Nature 600, 472–477 (2021).
Ellinghaus, D. et al. Genomewide association study of severe COVID-19 with respiratory failure. N. Engl. J. Med. 383, 1522–1534 (2020).
Pairo-Castineira, E. et al. Genetic mechanisms of critical illness in COVID-19. Nature 591, 92–98 (2021).
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
GTEx Consortium The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
GTEx Consortium The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
D’Antonio, M. et al. SARS-CoV-2 susceptibility and COVID-19 disease severity are associated with genetic variants affecting gene expression in a variety of tissues. Cell Rep. 37, 110020 (2021).
Tabula Sapiens Consortium The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans. Science 376, eabl4896 (2022).
Mason, R. J. Biology of alveolar type II cells. Respirology 11, S12–S15 (2006).
Kanai, M. et al. Meta-analysis fine-mapping is often miscalibrated at single-variant resolution. Cell Genom. 2, 100210 (2022).
Boyle, A. P. et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 22, 1790–1797 (2012).
Dong, S. et al. Annotating and prioritizing human non-coding variants with RegulomeDB v.2. Nat. Genet. 55, 724–726 (2023).
ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome Nature 489, 57–74 (2012).
Huang, D. et al. VannoPortal: multiscale functional annotation of human genetic variants for interrogating molecular mechanism of traits and diseases. Nucleic Acids Res. 50, D1408–D1416 (2022).
Nagai, A. et al. Overview of the BioBank Japan Project: study design and profile. J. Epidemiol. 27, S2–S8 (2017).
Dai, J. et al. Identification of risk loci and a polygenic risk score for lung cancer: a large-scale prospective cohort study in Chinese populations. Lancet Respir. Med. 7, 881–891 (2019).
Machiela, M. J. & Chanock, S. J. LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics 31, 3555–3557 (2015).
Wang, Z. et al. Meta-analysis of genome-wide association studies identifies multiple lung cancer susceptibility loci in never-smoking Asian women. Hum. Mol. Genet. 25, 620–629 (2016).
COVID-19 Host Genetics Initiative A first update on mapping the human genetic architecture of COVID-19. Nature 608, E1–E10 (2022).
Sudre, C. H. et al. Attributes and predictors of long COVID. Nat. Med. 27, 626–631 (2021).
Subramanian, A. et al. Symptoms and risk factors for long COVID in non-hospitalized adults. Nat. Med. 28, 1706–1714 (2022).
Resendez, S. et al. Defining the subtypes of long COVID and risk factors for prolonged disease: population-based case-crossover study. JMIR Public Health Surveill. 10, e49841 (2024).
Tsampasian, V. et al. Risk factors associated with post-COVID-19 condition: a systematic review and meta-analysis. JAMA Intern. Med. 183, 566–580 (2023).
Al-Aly, Z., Bowe, B. & Xie, Y. Long COVID after breakthrough SARS-CoV-2 infection. Nat. Med. 28, 1461–1467 (2022).
Antonelli, M. et al. Risk factors and disease profile of post-vaccination SARS-CoV-2 infection in UK users of the COVID Symptom Study app: a prospective, community-based, nested, case-control study. Lancet Infect. Dis. 22, 43–55 (2022).
Ayoubkhani, D. et al. Trajectory of long covid symptoms after covid-19 vaccination: community based cohort study. BMJ 377, e069676 (2022).
Du, M., Ma, Y., Deng, J., Liu, M. & Liu, J. Comparison of long COVID-19 caused by different SARS-CoV-2 strains: a systematic review and meta-analysis. Int. J. Environ. Res. Public Health 19, 16010 (2022).
Lu, M. M., Li, S., Yang, H. & Morrisey, E. E. Foxp4: a novel member of the Foxp subfamily of winged-helix genes co-expressed with Foxp1 and Foxp2 in pulmonary and gut tissues. Mech. Dev. 119, S197–S202 (2002).
Takahashi, K., Liu, F.-C., Hirokawa, K. & Takahashi, H. Expression of Foxp4 in the developing and adult rat forebrain. J. Neurosci. Res. 86, 3106–3116 (2008).
Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
Schmiedel, B. J. et al. Impact of genetic polymorphisms on human immune cell gene expression. Cell 175, 1701–1715.e16 (2018).
Wiehagen, K. R. et al. Foxp4 is dispensable for T cell development, but required for robust recall responses. PLoS ONE 7, e42273 (2012).
Li, S. et al. Foxp transcription factors suppress a non-pulmonary gene expression program to permit proper lung development. Dev. Biol. 416, 338–346 (2016).
Li, S. et al. Foxp1/4 control epithelial cell fate during lung development and regeneration through regulation of anterior gradient 2. Development 139, 2500–2509 (2012).
Chen, Y. et al. Downregulation of microRNA‑423‑5p suppresses TGF‑β1‑induced EMT by targeting FOXP4 in airway fibrosis. Mol. Med. Rep. 26, 242 (2022).
Yang, T. et al. FOXP4 modulates tumor growth and independently associates with miR-138 in non-small cell lung cancer cells. Tumour Biol. 36, 8185–8191 (2015).
Castanares-Zapatero, D. et al. Pathophysiology and mechanism of long COVID: a comprehensive review. Ann. Med. 54, 1473–1487 (2022).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Cunningham, F. et al. Ensembl 2022. Nucleic Acids Res. 50, D988–D995 (2022).
Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).
Zhou, W. et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Neupane, B., Loeb, M., Anand, S. S. & Beyene, J. Meta-analysis of genetic association studies under heterogeneity. Eur. J. Hum. Genet. 20, 1174–1181 (2012).
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol. 82, 1273–1300 (2020).
Wallace, C. A more accurate method for colocalisation analysis allowing for multiple causal variants. PLoS Genet. 17, e1009440 (2021).
Liu, B., Gloudemans, M. J., Rao, A. S., Ingelsson, E. & Montgomery, S. B. Abundant associations with gene expression complicate GWAS follow-up. Nat. Genet. 51, 768–769 (2019).
Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Hemani, G., Tilling, K. & Davey Smith, G. Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS Genet. 13, e1007081 (2017).
Pirinen, M. linemodels: clustering effects based on linear relationships. Bioinformatics 39, btad115 (2023).
Sollis, E. et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res. 51, D977–D985 (2023).
Boughton, A. P. et al. LocusZoom.js: interactive and embeddable visualization of genetic association study results. Bioinformatics 37, 3017–3018 (2021).
Acknowledgements
We are extremely grateful to all the participants, healthcare professionals, interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers and everyone participating in making possible the collection and analysis of datasets contributing to this study. We acknowledge the funding and research infrastructure support in Supplementary Note (see also the full Long COVID HGI author information in Supplementary Table 2).
Funding
Open access funding provided by Max Planck Society.
Author information
Authors and Affiliations
Consortia
Contributions
V.L., T.N., S.E.J., H.Z. and H.M.O. contributed to scientific leadership, project management, experimental design and conception, ethics and governance, and bioinformatics. V.L., T.N., S.E.J., H.Z., H.M.O. and the Long COVID HGI were members of the steering committee. V.L., S.E.J., T.N., H.Z., A.A.R., A.H.-C., A.M., A.N., A.R.D., A.S., A.S.F.K., B.C., B.G.-G., C.B., C.B.S., C.A.R.W., D.C.P., D.M.J., E.A., E.F., E.T.C., E.V., F.M., H.E.O., J.M.L., K.A.-V., K.B., L.C.-S., L.G., L.M., M.M., M.V., O.C.L., R.E., R.E.M., R.K.C., R.R., S.A., S.S.V., T.W.W., M.B., M.M.-H. and N.S.-A. performed primary cohort data analyses. V.L., T.N., S.E.J., M.B. and J.K. performed GWAS meta-analyses. S.E.J., T.N., V.L., H.Z., S.J.A., M. Kanai, A.O.-G., B.E.F.-H., H.H.H., M.P., A.K.M. and N.S.-A. performed follow-up analyses. A. Renieri, A. Rakitko, M. Kumari, A.C., A.N., C.E., C.J., E.C.S., E.L.D., F.G., G.D.S., H.M.O., I.M.H., J.B.R., J.J.G., J.L.-E., K.C., K.K.T., K.U.L., L.A., L.H.F., L.V.C.V., L.V.W., M.I., M.M.-H., N.D.B., N.J.T., O.B.V.P., P.J.S., P.M., R.A.V., R.d.C., R.K.M., R.W., S.A.L., S.L., S.S.V., T.T.-L., Y.O., A.O.-G., M.B., A.S. and H.Z. contributed to data/sample collection. Data for initial discovery GWASs (Long COVID HGI data freeze 4) was collected by DBDS, EstBB, FinnGen, GEN-COVID, GENCOV, MexGen-COVID (Supplementary Tables 1, 3–8 and 12), ALSPAC, BoSCO, BQC19, EXCEED, GCAT (COVICAT), Genotek, GOLD, Helix, Ioannina, IrCovid, JapanTaskForce, Lifelines, MoBa, MSCIC, PMBB, SweCovid, COMRI, TiKoCo, TwinsUK, UKB and Understanding Society (Supplementary Tables 11 and 12). Replication datasets were provided by PHOSP-COVID, MVP (Supplementary Tables 9, 10 and 12), LatviaGDB, C19-GenoNet, CHIRP, EstBB, FoGS and MGB (Supplementary Table 12). V.L., S.E.J., T.N., H.Z., A.G., A.K., A.N., E.L.D., E.M., H.F.A., M.J.D., M.M.-H., M.M.M., N.S.-A., P.S., U.A.Z., A. Renieri, A. Rakitko, M. Kumari, A.C., C.E., C.J., E.C.S., F.G., G.D.S., H.M.O., I.M.H., J.B.R., J.J.G., J.L.-E., K.C., K.K.T., K.U.L., L.A., L.H.F., L.V.C.V., L.V.W., M.I., N.D.B., N.J.T., O.B.V.P., P.J.S., P.M., R.A.V., R.d.C., R.K.M., R.W., S.A.L., S.L., S.S.V., T.T.-L., Y.O., A.A.R., A.H.-C., A.M., A.R.D., A.S., A.S.F.K., B.C., B.G.G., C.B., C.B.S., C.A.R.W., D.C.P., D.M.J., E.A., E.F., E.T.C., E.V., F.M., H.E.O., J.M.L., K.A.-V., K.B., L.C.-S., L.G., L.M., M.M., M.V., O.C.L., R.E., R.E.M., R.K.C., R.R., S.A. and T.W.W. wrote and reviewed the manuscript. All other authors were involved in the design, management, coordination or analysis of contributing studies. See Supplementary Tables 2–10 for more detailed information on author contributions and roles.
Corresponding authors
Ethics declarations
Competing interests
S.B. has ownerships in Intomics A/S, Hoba Therapeutics Aps, Novo Nordisk A/S, Lundbeck A/S, ALK abello A/S, Eli Lilly and Co and is managing board memberships in Proscion A/S and Intomics A/S. A.B., K.M.S.B., S.W., N.L.W., F.T., E.S. and E.T.C. are employees of Helix. A.D. received an honorarium from Gilead Sciences. A.L.G. and C.J. have funded research collaborations with Orion for collaborative research projects outside the submitted work. T.H. and H.E.O.B. have options in Sano Genetics. P.J.S. is a shareholder of Sano Genetics. T.H.K. has received consulting fees from Albireo, Boehringer Ingelheim, MSD and Falk Pharma. K.U.L. is cofounder and member of the scientific board of LAMPseq Diagnostics GmbH. T.N. has received speaking fee from Boehringer Ingelheim for talks unrelated to this research. M.E.K.N. is a current employee of Novartis Pharma AG. J.B.R.’s institution has received investigator-initiated grant funding from Eli Lilly, GlaxoSmithKline and Biogen for projects unrelated to this research. He is the CEO of 5 Prime Sciences (www.5primesciences.com), which provides research services for biotech, pharma and venture capital companies for projects unrelated to this research. V.F. is an employee of 5 Prime Sciences. C.D.S. reports grants and personal fees from AstraZeneca, Janssen-Cilag and ViiV Healthcare, personal fees and nonfinancial support from BBraun Melsungen, grants, personal fees and nonfinancial support from Gilead Sciences, personal fees from BioNtech, Eli Lilly, Formycon, Pfizer, Roche, Apeiron, GSK, Molecular partners, SOBI, AbbVie, MSD and Synairgen and grants from Cepheid. L.V.W. reports research funding from GlaxoSmithKline, Genentech and Orion Pharma, and consultancy for Galapagos and GlaxoSmithKline, outside of the submitted work. J.W. is a consultant for Roboscreen GmbH, Biogen GmbH, Immungenetics AG, Noselab GmbH, Roche Diagnostics International, Roche Pharma AG, Janssen-Cilag GmbH, Eisai GmbH, Boehringer Ingelheim and Lilly Deutschland GmbH and has received honoraries from Eisai GmbH, Biogen GmbH, AGNP e. V., Veranex, Med Update GmbH, Guangzhou Gloryren Medical Technology (China), Pfizer Pharma GmbH, Fachverband Rheumatologische Fachassistenz e. V., AWO Psychiatrie Akademie gGmbH, Neuroakademie E. V., Beijing Yibai Science und Technology Ltd., Abbott Laboratories GmbH, Lilly Deutschland GmbH, Simon & Kucher and streamedup! GmbH. The other authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–13 and Supplementary Note (Supplementary Methods and Acknowledgements).
Supplementary Tables
Supplementary Tables 1–36.
Supplementary Data
Harmonized association statistics for MR exposures and outcomes.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lammi, V., Nakanishi, T., Jones, S.E. et al. Genome-wide association study of long COVID. Nat Genet 57, 1402–1417 (2025). https://doi.org/10.1038/s41588-025-02100-w
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41588-025-02100-w