Introduction

Coronavirus disease 2019 (COVID-19) is a respiratory and systemic disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), responsible for one of the largest pandemics in history. As of April 2024, the disease had affected more than 704 million people, accounting for approximately 7 million deaths worldwide1. Despite worldwide advances in vaccination and the consequent decrease in fatal cases, new variants of concern with greater potential for vaccine escape and spread have emerged, prompting health authorities to remain vigilant for possible new waves of infection and hospitalizations2.

Most SARS-CoV-2 infections are mild, typically presenting with fever and cough, and recovery usually occurs within 2 to 3 weeks. Nonetheless, some patients may progress to severe complications, including acute respiratory distress syndrome (ARDS), septic shock, coagulation disorders, and multiple organ failure3. Several risk factors for severe COVID-19 have been described, for example, advanced age, male sex, smoking, and pre-existing conditions such as hypertension, diabetes, and cardiovascular, renal, or respiratory diseases4,5. Nonetheless, these factors do not fully explain why some previously healthy young individuals require hospitalization and ventilatory support due to COVID-196,7.

Differences in a host’s genetic profile, determined by common and rare genetic variants that influence susceptibility and clinical outcomes of COVID-19, may be decisive in explaining the variability in disease severity among patients8. The genome-wide association study (GWAS) meta-analysis conducted by the COVID-19 Host Genetics Initiative (HGI), including 219,692 cases and over 3 million controls, identified 51 distinct significant loci associated with critical illness, hospitalization, and susceptibility to SARS-CoV-2 infection9. However, this classical GWAS approach has not been applied to detect rare genetic variants that may influence the host response to SARS-CoV-2, particularly in patients exhibiting extreme COVID-19 phenotypes, such as fatal outcomes in previously healthy young and middle-aged individuals.

The COVID Human Genetic Effort (COVIDHGE) has played an important role in uncovering rare genetic factors associated with severe COVID-19. Initial studies identified inborn errors of immunity (IEIs) affecting type I interferon (IFN-I) pathways in 23 patients with severe disease, involving mutations in genes such as TLR3 and IRF710. Furthermore, COVIDHGE demonstrated that some individuals without IEIs carry neutralizing autoantibodies against IFN-I, which act as a “phenocopy” of these genetic defects, impairing the antiviral response11. The consortium also reported that approximately 1% of men under 60 developed critical pneumonia due to rare variants in the TLR7 gene on the X chromosome, a crucial viral sensor for the immune response12. Subsequent research revealed rare autosomal inborn errors of type I IFN-dependent immunity to influenza viruses that also underlie critical COVID-19 cases, particularly in individuals under 60 years of age13.

Genetic variants with the most significant influence on COVID-19 severity are likely rare. Nonetheless, not all rare variants necessarily exert a significant effect on disease severity14. Therefore, we hypothesized that an in-depth screening of loci previously associated with COVID-19 and gene- and variant-level prioritization could uncover rare variants with significant effects on disease severity and mortality.

Prior studies investigating rare variants associated with COVID-19 severity and mortality have identified genes involved in viral invasion mechanisms and molecules involved in inflammatory signaling15,16,17. Nevertheless, these studies have primarily focused on identifying genetic factors influencing the risk of death in COVID-19 patients from predominantly European populations. Host genetic factors for COVID-19 remain underexplored in Latin American populations. Due to the limited availability of genomic data from non-European and admixed individuals, studies targeting this gap are crucial. Genomic information can be integrated into clinical decision-making, influencing patient management and prognosis protocols. In this study, we conducted whole genome sequencing (WGS) on young and middle-aged Brazilian adults without pre-existing health conditions to identify rare genetic variants potentially implicated in life-threatening COVID-19.

Materials and methods

Subjects and clinical data

This retrospective cross-sectional study included 161 unrelated patients admitted to intensive care units (ICUs) in Brazil, recruited from August 2020 to September 2021. This cohort is referred here as COVID-19-BR. In this study, severe COVID-19 was defined strictly based on ICU admission, which was the primary inclusion criterion. To select individuals with extreme phenotypes, participants had to be 18 to 60 years old, have no history of chronic health conditions (e.g., obesity, cancer, diabetes, hypertension, or HIV/AIDS), and have a confirmed SARS-CoV-2 infection. The absence of chronic conditions was determined based on medical history documented at the time of hospital admission, either directly from the patient or, when not possible, from a close family member or caregiver. SARS-CoV-2 infection was primarily confirmed by molecular testing (RT-qPCR). In some cases, serological tests (IgG/IgM by immunochromatographic test or ELISA) were also considered, particularly in the early phases of the pandemic when RT-qPCR testing was not always readily available. However, serology alone was not used as the sole diagnostic criterion for acute infection. Patients were recruited from referral hospitals for COVID-19 from the following Brazilian states, which represent all regions of the country: Pernambuco, Bahia, Pará, Mato Grosso, Rio de Janeiro, and Rio Grande do Sul (Fig. 1). Patients’ electronic medical records were accessed to collect information on sex, self-reported race/ethnicity, SARS-CoV-2 test results, and clinical and laboratory findings upon hospital admission. None of the individuals had been previously vaccinated against SARS-CoV-2.

Fig. 1
figure 1

Distribution of patients in the COVID-19-BR cohort by Brazilian states.

The study proposal was submitted to the Research Ethics Committee of the different institutes involved in the study and it received approval: Aggeu Magalhães Institute/Fiocruz, Pernambuco (CAAE 36403820.2.0000.5190); Universidade Federal de Mato Grosso (CAEE 32361020.0.0000.5541); Oswaldo Cruz Institute, Nacional Infectology Institute Evandro Chagas/Fiocruz and Universidade Federal Fluminense, Rio de Janeiro (CAAE 68118417.6.0000.5248; 32169120.1.0000.5262 and 0623520.5.0000.5243); Nossa Senhora da Conceição Hospital, Rio Grande do Sul (CAAE 68118417.6.3003.5530); Universidade Federal do Pará (CAAE 33470020.0.1001.0018). The study adhered to the tenets of the Declaration of Helsinki for research involving human subjects. All patients or their legal representatives signed an informed consent form. In cases where patients were intubated or unable to provide consent due to their medical condition, consent was obtained from the individual responsible for the patient’s hospitalization, in accordance with ethical guidelines and institutional protocols.

DNA extraction and whole genome sequencing

Patients’ genomic DNA was extracted from whole blood using a ReliaPrep Blood gDNA Miniprep System (Promega®) commercial kit. The concentration and purity of DNA samples were assessed using a Qubit® DNA assay kit, with the aid of a Qubit® 2.0 fluorometer (Life Technologies). Sequencing libraries were prepared according to the Illumina DNA PCR-Free Library Prep protocol and quantified using a ProNex NGS Library Quant kit. Genomic sequencing was performed on a NovaSeq 6000® system (Illumina).

For the initial genomic data processing, we employed an established workflow that implements the Genome Analysis Toolkit (GATK) with best practices for calling small germline variants (see https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling version 2.1.1). For each sample, paired-end reads were quality-checked and trimmed using Trimmomatic (parameters: LEADING:3, TRAILING:3, SLIDINGWINDOW:4:15, MINLEN:36). The remaining sequenced reads were aligned to the reference human genome (GRCh38, ENSEMBL release 98) with BWA, sorted using samtools and deduplicated using Picard. Reads were realigned in regions with identified indels to improve accuracy. Base recalibration (base quality score recalibration [BQSR]) was used to identify systematic errors in sequencing data and to recalibrate the quality scores to reflect the actual probability of error.

The next steps of the pipeline identified genetic variants (base substitutions and short indels) employing a joint call with all samples. Subsequently, hard filters were applied to reduce false positives (for Single Nucleotide Variant (SNV): quality by depth [QD] < 2.0 || Fisher strand bias [FS] > 60.0 || mapping quality [MQ] < 40.0 || mapping quality rank sum test [MQRankSum] < − 12.5 || read position rank sum test [ReadPosRankSum] < − 8.0; and for indels: QD < 2.0 || FS > 200.0 || ReadPosRankSum < − 20.0).

Ancestry inference

To conduct principal component analysis (PCA), the full dataset of unrelated individuals of the 1000 Genomes Project (phase 3), which comprises subjects with European (EUR), African (AFR), East Asian (EAS), South Asian (SAS), and Native American (AMR) ancestries, was used as a reference. This 1000 Genomes panel was merged with the genetic data of the Brazilian cohort, using autosomal variants with minor allele frequency (MAF) > 0.1 that were common to both datasets. Subsequently, the merged data were pruned using PLINK 1.9 software with a window size of 50 markers, a step size of 5, and a variance inflation factor threshold of 1.5, leaving 95,844 markers to calculate principal components with PLINK18.

The ADMIXTURE software19 was used to estimate individual ancestry of the COVID-19-BR sample. This analysis was conducted under an unsupervised mode, using the EUR, AFR, and AMR samples of the 1000 Genomes Project and the same pruning approach described above. K = 3 was assumed based on the main continental parental groups (Europeans, Africans, and Native Americans) that contributed to the formation of the Brazilian population20.

Variants prioritization

To investigate rare variants in loci previously associated with COVID-19 severity, we utilized a list of genomic regions identified by the COVID-19 Host Genetics Initiative (HGI) using the filtered dataset “COVID19_HGI_A2_ALL_leave_23andme_20220403_1e-5.tsv”. Specifically, we collected data from the GWAS meta-analysis, round 7 (very severe respiratory confirmed COVID-19 versus population controls). The total population analyzed included 18,152 cases and 1,145,546 controls. We considered only significant variants (p < 5 × 10−8) identified by the HGI and grouped these variants into loci, including a 50 kb extension in each of their flanking regions (Supplementary Table S1).

VCFtools21 was used to select in our samples variants in loci previously associated with COVID-19 by the HGI consortium. The variant effects were predicted with the Ensembl Variant Effect Predictor and the Ensembl GRCh38.p14 reference database. Variant prioritization was conducted as follows: (a) functional impact predicted as moderate to high according to the Variant Effect Predictor (Ensembl) algorithm22including non-synonymous or splice-site variants. Synonymous, intronic, and non-coding variants were excluded from the analysis; (b) MAF ≤ 0.01 in the entire dataset (ALL) from the 1000 Genomes Project and the Genome Aggregation Database (gnomAD); (c) Combined Annotation-Dependent Depletion (CADD) score > 1523; and Gene Damage Index (GDI) score < 13.8424. For variants with a presumably disruptive impact on the protein, such as splice-site, stop-gained, frameshift, stop-lost, and start-lost variants, which could be analyzed by the GDI score but not the CADD score, additional annotations were performed considering the probability of being a loss-of-function (LoF) intolerant (pLI) gene25 and the LofTool26 metrics. It is important to note that neither the pLI threshold nor the LofTool score was used as a filtering criterion during the initial variant prioritization process, but rather as complementary annotations to further characterize the identified genes and assess their potential intolerance to loss-of-function variants.

ClinVar (ncbi.nlm.nih.gov/clinvar/), AlphaMissense27 and the American College of Medical Genetics (ACMG) guidelines were used to assess the pathogenic potential of the prioritized variants. ClinVar was utilized to identify variants previously associated with clinical phenotypes in humans, while AlphaMissense provided predictive classification for missense variants. The pathogenicity assessment followed the ACMG guidelines and was performed using the Franklin software by Genoox28. Variants classified as pathogenic or likely pathogenic by at least one of these sources were included in the analysis, whereas those described as likely benign or benign were excluded.

To verify whether the variants identified in the COVID-19-BR cohort were exclusive to severe cases, we applied an additional filtering step using our internal database, which contains 39 genomes from unvaccinated Brazilian patients with mild COVID-19 symptoms, collected during the same period as the severe cases. Supplementary Figure S1 provides the characterization of these individuals.

Brazilian reference populations

The frequencies of candidate variants for severe COVID-19 identified in the COVID-19-BR cohort were investigated using two reference databases of the Brazilian population: SABE29which includes 1,171 unrelated individuals from the city of São Paulo and 61,174,462 variants, and the Variant Browser of the “DNA do Brasil” Project (http://www.dnabr.science), which has a sample size of 2,723 individuals with WGS data.

Statistical analysis

Associations between the occurrence of candidate variants for severe COVID-19 and clinical/laboratory parameters were assessed using the Mann-Whitney U test for continuous variables or the chi-square test for categorical variables. The COVID-19-BR cohort was stratified into the following two subgroups: (1) patients carrying prioritized variants and (2) patients not carrying these variants. Additionally, to investigate a potential association between candidate genetic variants and individual global ancestry, we performed a stratified analysis based on global ancestry components (EUR, AFR, AMR), categorizing individuals as above or below the median for each ancestry group. P < 0.05 was considered statistically significant.

Results

Characteristics of patients with life-threatening or fatal COVID-19

This study analyzed 161 unrelated patients, including those with life-threatening COVID-19 and individuals who did not survive the disease. The median age of the patients was 44 years (interquartile range: 37–53), with the majority being male (n = 112; 69.6%) (Supplementary Fig. S2A). The median time from symptom onset to admission was 10 days, and the median hospital stay was 13 days (Supplementary Fig. S2B; Supplementary Fig. S3A). The patients’ main symptoms upon hospitalization (Supplementary Fig. S3B) included dry cough (70.2%) and dyspnea (68.9%), followed by fever ≥ 38 °C (44.7%) and oxygen saturation below 95% (40.9%). Other frequently reported symptoms were asthenia and muscle pain, both present in 28.6% of cases. Headache was observed in 22.4% of patients. Alterations in smell and/or taste occurred in 16.77%, followed by general malaise (13.4%) and diarrhea (11.8%). As shown in Supplementary Fig. S3C, all patients required ICU admission. Of these, 93.8% required oxygen support, including nasal catheter, Venturi mask, or orotracheal intubation. Ventilatory support was required in 71.4% of cases, with orotracheal intubation alone performed in 45.3%. Additionally, 27.9% of patients required vasopressor therapy. ARDS was diagnosed in 32.9% of patients. Other complications included renal failure (8.0%), need for dialysis/hemodialysis (7.4%), shock (6.8%), and sepsis (4.9%). In total, 33 patients (20.5%) died.

Laboratory results (Supplementary Table S2) showed abnormalities that were consistent with the clinical severity observed in the cohort. The majority of patients had neutrophilia (median 8,228 cells/mm³; IQR: 5,967 to 10,557) and lymphopenia (median 869 cells/mm³; IQR: 565 to 1,188). In addition, elevated C-reactive protein levels were detected, with a median of 21.9 mg/L (IQR: 7.25 to 93.5), providing evidence of systemic inflammation.

Whole genome sequencing and ancestry analysis of COVID-19-BR patients

The median read count for the genomes of the 161 patients was 759.6 million, with a median read depth of 69.1× (Supplementary Fig. S4). Supplementary Table S3 presents a statistical summary of the SNVs and short indels identified through WGS in the COVID-19-BR cohort. The principal component analysis (PCA) of the genomic data revealed that the COVID-19-BR sample formed a heterogeneous group, distributed mainly among European (EUR), African (AFR), and Native American (AMR) reference populations (Fig. 2A). The median global ancestry results were 0.60 (IQR: 0.45 to 0.77) EUR, 0.23 (IQR: 0.11 to 0.35) AFR, and 0.10 (IQR: 0.06 to 0.16) AMR (Fig. 2B).

Fig. 2
figure 2

Ancestry analyses of the COVID-19-BR cases. (A) Principal component analysis of 161 patients from the COVID-19-BR cohort and samples from the 1000 Genomes Project. (B) Individual ancestry bar plot of COVID-19-BR using unsupervised ADMIXTURE analysis. Abbreviations – AFR: African; AMR: Native American; EUR: European; SAS: South Asian; EAS: East Asian; IQR: interquartile range (first to third quartiles).

Variant prioritization

The entire genome of patients was sequenced to identify rare genetic variants potentially implicated in COVID-19 severity. We initially identified 242,855 variants in loci previously described as being associated with COVID-19 severity by the HGI consortium. Only non-synonymous and splice variants with predicted moderate to high functional impact were selected (n = 3,625). Considering that variants with a major effect on COVID-19 severity are rare, only variants with MAF ≤ 1% in the global datasets of the 1000 Genomes and gnomAD projects were selected (n = 1,498). Variants with a CADD score greater than 15 were prioritized due to their higher likelihood of being deleterious. Additionally, genes with a GDI score below 13.84—indicating lower mutation tolerance and an increased probability of harboring pathogenic variants—were also selected (n = 140).

Finally, we checked the occurrence of these variants in the group of 39 genomes from Brazilian patients with mild COVID-19 symptoms (see ‘Methods’). This analysis revealed that 104 variants, across 79 genes, were exclusively found in severe or fatal COVID-19 cases (Fig. 3; Supplementary Table S4).

Fig. 3
figure 3

Strategy for prioritizing functional variants in COVID-19-BR cases. The numbers shown at each stage represent the remaining variants after applying the corresponding filters. Abbreviations – 1KGP: 1000 Genomes Project, Phase 3; CADD: Combined Annotation-Dependent Depletion algorithm; GDI: Gene Damage Index; gnomAD: Genome Aggregation Database version 4.1; MAF: minor allele frequency.

These variants were found in 89 patients (55.3%), with 35 of these patients carrying two or more variants (Fig. 4A). Most of the variants were found in a heterozygous state, with only two variants identified in a homozygous state. The majority of these variants were classified as missense, representing (41.5%) of the total (Fig. 4B). Twenty-six variants (24.5%) were classified as pathogenic or likely pathogenic based on data from ClinVar, AlphaMissense, or ACMG criteria (Table 1).

Fig. 4
figure 4

Frequency of patients and functional classification of variants in COVID-19-BR cases. (A) Distribution of patients by the number of rare variants potentially implicated in COVID-19 severity. (B) Distribution of rare variants by functional consequences.

Table 1 Pathogenic and likely pathogenic rare variants in COVID-19-BR cases.

Six variants in the MUC5AC gene, including three frameshift variants and three in-frame deletions, were identified in nine patients, all carrying at least one of these variants (Supplementary Table S4). Among these alterations, the rs1590143470 variant had a MAF below 0.0005 in reference populations, while the remaining variants were novel and not reported in reference population databases.

Furthermore, in the IFNA10 gene, a LoF variant (rs145785282) that introduces a premature stop codon, was identified in seven patients. This variant is very rare, with a MAF of 0.008 in the 1KGP and 0.009 in gnomAD. The third gene with the highest recurrence of variants was ZNF778, with three SNVs identified in six patients. Moreover, the missense variant rs563641001 (MAF = 0.009) in the PTOV1 gene was found in five patients. Variants in the ATG4D, HSD17B14, PRSS50, and RAB25 genes were observed in four patients each, while SNVs in the C4B, C6orf15, DNAJC28, DXO, and HRC genes were identified in three patients.

The variants rs45534831 (located in the DXO gene) and rs147316998 (located in the PRSS50 gene) were the only ones identified in a homozygous state, each found in a single patient. Both were classified as likely pathogenic based on the AlphaMissense prediction, reinforcing their possible role in the predisposition to severe COVID-19.

We identified 17 novel variants, with no frequency recorded in the main public databases. Among these, the following three missense variants were classified as likely pathogenic: NM_004381.5:c.1787 C > A in the ATF6B gene, NM_000258.3:c.250G > A in the MYL3 gene, and NM_020126.5:c.245 A > T in the SPHK2 gene. Furthermore, other variants showed high functional relevance, such as NM_001136.5:c.798_802del in the AGER gene, NM_003024.3:c.2304 + 1G > A in the ITSN1 gene, and two variants in the SCAF1 gene (NM_021228.3:c.619_620insGC and NM_021228.3:c.1783_1794dup), all showing a pLI > 0.9, which indicates high gene intolerance to LoF mutations.

In total, 17 variants in genes with pLI > 0.9 were identified, including: DPP9, ILF3, AGER, ITSN1, SNRNP70, SCAF1, AP2A1, KAT7, MYH14, GON4L, RXFP4, and LMNA. Among the 15 patients carrying these variants, 7 died (46.7%). Given this frequency, we compared it to the overall cohort mortality rate of 20.5% and found a statistically significant association (p = 0.0084, OR = 4.04, 95% CI = 1.35–12.13), indicating that patients with variants in high-pLI genes had a higher risk of death (Supplementary Table S5).

Candidate genetic variants to severe COVID-19 and clinical/laboratory parameters

To investigate the association between candidate genetic variants and clinical/laboratory parameters of severe COVID-19, we stratified the COVID-19-BR cohort into two subgroups: patients carrying prioritized variants (n = 89) and those without these variants (n = 72). Our analysis did not involve testing each of the 104 prioritized variants individually for association with clinical outcomes. Instead, we created a single binary variable indicating whether each patient carried at least one of the 104 previously prioritized rare variants. This variable (“carrier of ≥ 1 prioritized variant: Yes/No”) was then used to assess potential associations with clinical and laboratory features. These groups were compared in terms of demographic, clinical, and laboratory characteristics to identify potential differences that might influence the outcomes.

Importantly, there were no significant differences between the groups regarding sex distribution (p = 0.472) or age (p = 0.605), indicating that these factors, known to be potential confounders, were well balanced between the subgroups (Supplementary Table S6). Additionally, to investigate the potential association between candidate genetic variants and patients’ ancestry, we performed a stratified analysis based on global ancestry (EUR, AFR, AMR). Individuals were categorized based on the median of each ancestry group. This analysis did not reveal any significant associations between the identified variants and a specific ancestral group, suggesting that these variants are not strongly influenced by ancestry in our cohort (Supplementary Table S6).

Patients with variants potentially implicated in severe COVID-19 had a significantly higher incidence of acute respiratory distress syndrome (ARDS) compared to those without such variants (40.4% vs. 23.6%, p = 0.027, OR = 2.59, 95% CI: 1.11–6.05). Other clinical and laboratory parameters did not differ significantly between the groups (Table 2).

Table 2 Association between the occurrence of variants potentially implicated in severe COVID-19 and clinical/laboratory parameters.

Discussion

Investigations into host genetic factors involved in COVID-19 are essential for advancing our understanding of the disease’s clinical progression, improving healthcare outcomes, and reducing mortality rates. These studies are expected to play a critical role in genomic and precision medicine. This is particularly important for underrepresented populations in global genomic studies and databases. In this study, WGS was performed in 161 young and middle-aged Brazilian adults with life-threatening or fatal COVID-19, aiming to identify rare genetic factors that may explain individual predisposition to disease severity. To date, this is the first genomic analysis of previously healthy young and middle-aged Latin American patients with severe COVID-19.

Genomic studies focusing on rare variants in Brazilian patients with COVID-19 remain limited. Secolin et al. (2021) reported three rare and four ultra-rare variants in four COVID-19-related genes—SLC6A20, LZTFL1, XCR1, and FURIN—that were exclusively identified in the Brazilian dataset and are predicted to affect protein function30. Other studies involving Brazilian patients and rare variants have predominantly focused on children with SARS-CoV-2–related Multisystem Inflammatory Syndrome (MIS-C)31,32,33.

A recent meta-analysis by the HGI (data release 7) identified several candidate genes that define the main biological pathways (virus entry, mucus defense, and role of interferons) involved in COVID-19 susceptibility and severity9. Their approach has provided useful information to identify specific host pathways and molecules that are important in COVID-19 pathogenesis; however, common variants have low effect sizes and explain only a very small fraction of clinical variability34.

Rare variants play unique roles in the genetics of complex diseases, as they could have a greater impact on gene function and expression as well as greater population specificity35. For this reason, we investigated rare variants in loci whose common variants have already been associated with COVID-19. Despite significant efforts to understand the biological mechanisms underlying COVID-19, the wide clinical variability between individuals remains a fundamental scientific challenge. This variability has direct implications on the identification of high-risk patients, clinical decision-making, and the development of personalized treatments.

A recent study investigated the presence of rare genetic variants in 44 patients from a Spanish cohort with very severe or fatal COVID-19 under the age of 6517. They found variants in genes related to immune response, carbohydrate metabolism, and DNA repair processes; however, most of their patients were of European descent (86%), and the inclusion criteria were not restricted to previously healthy individuals. On the other hand, our cohort included, for the first time, previously healthy patients with severe COVID-19 from all regions of Brazil. Our ancestry analyzes show a high genetic diversity20which highlights the complexity of the genetic composition of the Brazilian population. It is crucial to study diverse populations in order to identify specific genetic variations and better understand the genetic architecture of severe COVID-19.

The MUC5AC gene stood out as the most recurrent in our cohort, with nine patients presenting ultra-rare variants, including frameshift and inframe deletions. A previous study conducted in the Bulgarian population also reported a high prevalence of individuals carrying rare variants in this gene (rs36195734, rs200292517, and rs74811639), particularly among patients with critical COVID-1936. In contrast, our study identified other rare variants, including rs1590143470 and five additional novel variants. MUC5AC is one of the primary mucins produced in the airways, playing a crucial role in pathogen defense and being upregulated during respiratory infections37.These alterations may compromise the mucosal defense of the airways, thus increasing vulnerability to severe infections.

As previously suggested by other studies, the dysregulation of the type I interferon response plays an important role in COVID-19 severity, particularly in young (< 60 years) patients with severe or critical COVID-19 without comorbidities10,13,36,38. We identified the rs145785282 variant in the IFNA10 gene in seven patients, five of whom were severely affected with ARDS. This SNV was recently reported in a Brazilian child with multisystem inflammatory syndrome associated with SARS-CoV-231. Although this variant has not yet been directly implicated in COVID-19 susceptibility or severity, another variant in the same gene, rs28368148, has already been described as a risk factor for critical cases of COVID-19 (OR = 1.56; 95% CI = 1.38 to 1.77; p = 3.7 × 10⁻¹²)39. In our cohort, the rs28368148 variant was identified in five patients. However, as our goal was to identify variants exclusive to the severe COVID-19 group, this variant was excluded from the final list after applying the filtering step, since it was also present in one patient with mild COVID-19.

Seventeen new variants were identified in genes involved in different biological processes in COVID-19, such as mucosal immune response (MUC5AC, MUC21), extracellular matrix disassembly (ATF6B), regulation of molecular functions (TBC1D17, ARHGAP27, SPHK2, ITSN1, MYL3), and mitochondrial efficiency (SCAF1). We also identified a new variant in the C6orf15 gene, a functionally uncharacterized MHC gene, and a new variant in the AGER gene, linked to the requirement for mechanical ventilation and increased mortality40.

We identified candidate variants in genes that showed a high likelihood of intolerance to heterozygous LoF variation (pLI ≥ 0.9), as defined by Lek et al. (2016). This means that a single LoF variant can cause a severe clinical phenotype due to haploinsufficiency in genes such as DPP9, ILF3, AGER, ITSN1, SNRNP70, SCAF1, AP2A1, KAT7, MYH14, GON4L, RXFP4, LMNA, and GFY. Interestingly, a statistical comparison between the mortality rate of patients carrying variants in high-pLI genes and the overall cohort mortality rate showed a statistically significant association (p = 0.0084, OR = 4.04, 95% CI = 1.35–12.13), indicating that patients with variants in highly LoF-intolerant genes had a fourfold higher risk of death. In addition to allelic dosage, it is likely that the genetic effects of many rare variants contribute to the overall polygenic effect in severe COVID-1941. In this context, 52% of patients carried more than one candidate variant.

Our findings revealed that patients carrying variants potentially implicated in the severity of COVID-19 had a significantly higher incidence of ARDS compared to patients not carrying these variants. These results point to an involvement between rare high-impact genetic factors and the progression to more critical stages of the disease, which is in line with previous studies that associated specific genetic variants with exacerbated immune and inflammatory dysfunction in severe cases17,36,41.

Among the prioritized variants, 20 were classified as likely pathogenic by AlphaMissense, with a score higher than 0.6, indicating a high probability of causing functional disorders. While experimental validation remains the gold standard for confirming pathogenicity, these predictive analyses represent a valuable preliminary step in identifying candidates for further investigation27.

Some limitations should be considered. Although 89 patients (55.28%) had prioritized variants in this study, 72 did not present these variants. This may be attributed to the fact that we evaluated only significant regions identified by the HGI analyses, mainly focused on genes related to viral entry, airway mucosal defense, and response to type I interferon. These results suggest that, in patients not carrying these candidate variants, the severity of COVID-19 may be explained by loci that were not analyzed in this study or by a set of rare genetic variants with small effects that, even so, contribute to more severe clinical presentations. Additionally, while functional validation is essential for determining the true impact of genetic variants, experimentally assessing every identified variant is not feasible—particularly in studies involving rare variants across multiple genes. Our study follows a widely accepted framework for variant prioritization based on predictive annotations, facilitating the identification of potentially impactful genetic factors associated with severe COVID-19.

Our findings suggest the potential involvement of rare genetic variants in the severity of COVID-19, especially in the host immune response, as shown by ARDS, a characteristic of the most severe forms of the disease, which predisposes to an exacerbated inflammatory condition and increased demand for oxygen support. The novel variants identified in our cohort, in conjunction with the heterogeneity observed among patients, highlight the complexity of the genetic architecture of COVID-19, especially in diverse populations. Rare variants may contribute to increased susceptibility to life-threatening COVID-19; however, further studies are needed to confirm the definitive role of the variants described by this study in disease severity.