Introduction

Alport syndrome (AS) is an inherited basement membrane disease characterised by progressive kidney failure, sensorineural hearing loss and ocular abnormalities1. Estimates of its disease frequency range from a prevalence of one in 5000 people in Utah2 to one in 53,000 live births in Finland3, but the number of people with a predicted genetic risk of disease is even higher4.

AS results from pathogenic variants in COL4A55, COL4A3 or COL4A46. These genes encode the collagen IV α5, α3 and α4 chains respectively, that trimerise to form a triple helical structure and chickenwire network typical of basement membranes7,8. X-Linked AS is the commonest form that causes kidney failure9. Males are more severely affected than females and 70% have kidney failure by the age of 3010. Females are affected twice as often as males but normally have a milder and more variable phenotype11 due in part to non-random X- chromosome inactivation12.

Autosomal recessive AS is less common, and results from two pathogenic variants affecting the COL4A3 or COL4A4 genes6. These may be homozygous6, or compound heterozygous variants in trans13. Digenic variants result most often from one pathogenic variant in COL4A3 and one in COL4A414. Pathogenic heterozygous variants in COL4A3 or COL4A4 result in autosomal dominant AS (also known as ‘thin basement membrane nephropathy’), with haematuria15, and late-onset kidney failure in up to 25% of affected individuals in hospital-based series16,17,18.

Each collagen IV α chain comprises an intermediate collagenous domain of Gly Xaa Yaa triplet repeats19,20,21. However, the collagen IV α chains differ from most other collagen types in that the collagenous domain includes 21–26 short non-collagenous interruptions21. These provide flexibility to an otherwise rigid molecule, and may include important ligand binding sites22. Collagen IV chains also differ in that the mature chains retain their non-collagenous amino and carboxyl termini, which interact with the termini of neighbouring trimers to create the network8. The carboxyl terminus is also the site of chain recognition where trimerisation begins, proceeding in a zipper-like manner towards the amino terminus23.

Previous genotype–phenotype correlations in males with pathogenic COL4A5 variants have demonstrated that truncating variants and large deletions lead to the most severe phenotype, with the youngest age at kidney failure10,24. The corresponding mRNA is degraded by the podocytes, the and there is no collagen IV α5 chain incorporated into the trimer or staining in a kidney biopsy25. Missense variants usually result in milder disease10,24 and the abnormal chain may be incorporated into mature trimers that are secreted into the basement membrane25 in reduced amounts26. The abnormal chains are also often retained within the podocytes, activating the unfolded protein response and increasing endoplasmic reticulum stress27,28. Similar studies in females with heterozygous COL4A5 variants have not found such a clear genotype–phenotype correlation with age at kidney failure11,12, but a recent study suggested that females with missense variants were less likely to develop proteinuria and had better kidney function than those with other variant types29.

In autosomal recessive AS, individuals with at least one truncating variant in COL4A3 or COL4A4 are more likely to progress to kidney failure before the age of 30 years than those with non-truncating variants30. Disease progression also correlates with the number of missense variants, where individuals with at least one missense variant have a delayed onset of kidney failure and hearing loss compared with those with none31. In individuals with autosomal dominant AS, heterozygous truncating variants in COL4A3 or COL4A4 are associated with an earlier age at kidney failure than those with missense variants16.

Missense variants affecting Gly residues in the collagenous Gly Xaa Yaa repeats are the commonest pathogenic type32. These residues are critical to the structure since Gly is the only amino acid small enough to fit within the core of triple helix and allow close packing of the chains32,33. Substitution with any other amino acid may destabilise the trimer, interfere with triple helix propagation, and cause disease34.

The clinical phenotype associated with pathogenic Gly missense variants is highly variable. Gly substitution with a bulky or charged amino acid usually leads to more severe clinical features35, but contrary examples also exist36. In other collagen types, Gly substitutions in the amino exons result in a milder phenotype37, but evidence for this in collagen IV is conflicting24,38. Location adjacent to a non-collagenous interruption has also been associated with a milder phenotype39, but again this is variable35.

The aim of this study was to better understand the molecular features of Gly substitutions in the COL4A3–COL4A5 genes that affect disease severity.

Methods

Variant databases

Three variant databases were examined. The Leiden Open Variation Database (LOVD) is an open source database of genomic variants with associated phenotypes (https://www.lovd.nl)40. It includes variants published in the literature in addition to those submitted directly by laboratories, and has recently been updated to include a total of 3869 (including 2988 pathogenic) COL4A5 variants. Pathogenicity was assessed by the submitting laboratory, or where none was provided, using the VarSome scores (https://varsome.com) based on the American College of Medical Genetics and Genomics and Association for Molecular Pathology (ACMG/AMP) criteria. Varsome scores automatically include previous ClinVar or other published assessments (PP5). Many variants also included clinical data such as gender, age at kidney failure, hearing loss and ocular abnormalities. This database was used to determine whether molecular features of pathogenic COL4A5 variants that resulted in Gly substitutions affected age at kidney failure or hearing loss diagnosis using survival analysis.

The Genomics England 100,000 Genomes Project (100kGP) is a database comprising genomic and clinical data from individuals and families with various diseases, including familial haematuria and other inherited kidney disease (https://www.genomicsengland.co.uk; version 10 data release)41. This was used to determine whether molecular features of heterozygous pathogenic COL4A3 and COL4A4 variants that resulted in Gly substitutions were associated with haematuria.

The Genome Aggregation Database (gnomAD) comprises exomes and genomes from individuals recruited as part of various disease-specific and population genetic studies (gnomAD version 2.1.1; https://gnomad.broadinstitute.org; accessed 11 September 2021)42. These primarily include participants and controls from studies of cardiovascular disease, diabetes or psychiatric disorders, who have not been selected for kidney disease but rather to represent a cross-section of the population. This database was used to determine whether milder molecular characteristics of COL4A5 Gly substitutions were increased the general population.

The individuals whose variants and other deidentified information were included in these databases had provided informed consent at the time of recruitment under the supervision of the corresponding institutional review boards.

Reference sequences

COL4A5 variants were described using the collagen IV α5 chain isoform 2 reference sequence comprising 53 exons (NM_033380.3). COL4A3 and COL4A4 variants were described using the reference sequences for the collagen IV α3 (NM_000091.5) and α4 (NM_000092.5) chains respectively.

Predicted splicing changes

Previous studies have demonstrated that exonic nucleotide substitutions affecting Gly codons in COL4A5 sometimes result in abnormal splicing43,44, and recent evidence suggests that this is common when the affected base is the final nucleotide of an exon45. To ensure that unknown splicing variants were not unintentionally included in this study, all variants occurring within 3 bases of a splice site were analysed using MaxEntScan to determine whether they were likely to affect normal splicing46. MaxEntScan was chosen since it is freely available and was able to correctly identify 6 known exonic splice changes previously reported for COL4A5 (data not shown). Both mutant and wild type sequences were scored using the maximum entropy model. Variants were considered to affect splicing where the mutant score was more than 15% lower than the wild type score47. All variants predicted to affect normal splicing were excluded to ensure that any phenotypic effect attributed to a variant was due solely to a missense change, rather than an unreported splicing change (Supplemental Table 1).

Molecular characteristics

Three molecular features were examined for each Gly substitution. These were the molecular location of the variant (exons 1 to 20, or exons 21 to carboxyl terminus), whether the variant was adjacent to a non-collagenous interruption or terminus (also called ‘non-collagenous boundary’ variants) (Supplemental Table 2), and the degree of instability caused by the residue replacing Gly (Ala, Ser, Cys were considered mildly destabilising; Arg, Val, Glu, Asp, Trp were considered highly destabilising)34.

In addition a subgroup of variants was examined separately to determine the effect of a variant’s relative location with respect to its local collagenous region (a single uninterrupted stretch of Gly XY repeats flanked by two non-collagenous interruptions/termini). This subgroup excluded all non-collagenous boundary variants to ensure that the significantly different phenotypes usually observed for these variants did not obscure any small effect size. The remaining variants were considered in three groups: variants affecting the 2 Gly residues at the amino end of a local collagenous region (not counting the boundary Gly), variants affecting the 2 Gly residues at the carboxyl end of a local collagenous regions, and all other variants falling between these two ends (Fig. 1).

Figure 1
figure 1

Terminology used in this study to describe the position of Gly residues with respect to their local collagenous region. One single ‘local’ collagenous region is shown, bounded by non-collagenous interruptions at either end. The boxes could also represent either of the two non-collagenous termini. The non-collagenous boundary Gly residues are a special case, so were not included in this subgroup. The larger arrow indicates the direction of trimerisation, which is initiated at the carboxyl terminus and then proceeds in the amino direction. NC, non-collagenous.

COL4A5

Kidney failure

COL4A5 variants reported in LOVD were filtered to include those with entries including any of the following keywords: ‘renal’, ‘failure’, ‘ESRD’, ‘ESRF’, ‘ESKD’, ‘ESKF’, ‘transplant’, ‘dialysis’ (Supplemental Fig. 1a). Only variants affecting a Gly residue in a collagenous sequence (GlyXaaYaa) and reported as ‘Pathogenic’ or ‘Likely Pathogenic’ were included. Predicted splicing variants were excluded as described. Each entry was then examined manually to determine the age and kidney failure status of male participants. Age at kidney failure was defined as the age at diagnosis of kidney failure, or where this was not reported, the age at commencement of dialysis, or at first kidney transplant. Unclear or ambiguous entries in LOVD were resolved by referring to the original manuscripts.

Each family was included once only. Where multiple affected males were reported in the same family, the mean age at kidney failure was used (or median age at kidney failure, if this was the only value available). Where only a range of ages for kidney failure was reported, the midpoint of this range was used. Where a male had not yet progressed to kidney failure, the age of the male at the most recent report was used as a censored data point. Families with only affected females, or participants who had multiple variants in the COL4A3-COL4A5 genes were excluded.

Hearing loss

COL4A5 variants reported in LOVD were filtered to include those which had entries with the following keywords: ‘hearing’, ‘hypoacusia’, ‘deaf’, ‘sensorineural’, or ‘audio’ (Supplemental Fig. 1b). Onset of hearing loss is often poorly recognised and probably occurs much earlier than reported, so here the age at diagnosis of hearing loss was used as a measure of hearing loss severity. Ages were extracted as for kidney failure, using similar inclusion and exclusion criteria.

COL4A3/COL4A4

Haematuria

Individuals with and without haematuria were identified from the 100kGP database. The haematuria cohort included unrelated individuals with any haematuria-related terms (HP:0000790, hematuria; HP:0002907, microscopic hematuria; HP:0012587, macroscopic hematuria), excluding those with a diagnosis of kidney or bladder cancer, or other less common causes (n = 2221). The ancestry-matched control cohort included all individuals who did not have any documented haematuria in their medical records (n = 37,200). Individuals were then filtered to include only those reported to have a heterozygous Gly missense variant in COL4A3 or COL4A4. Variants near splice sites were assessed for predicted splice changes as described.

All variants included were assessed for the same molecular characteristics as described above, and a logistic regression model used to identify which features were associated with a difference in risk for haematuria. All variants not adjacent to a non-collagenous region were also examined separately to determine any associations with local collagenous location.

Prevalence of COL4A5 non-collagenous boundary variants in normals

To determine whether non-collagenous boundary variants were increased in the general population, the collagenous location and prevalence of all Gly substitutions in gnomAD were investigated. Predicted splicing variants were excluded.

The expected proportion of variants affecting non-collagenous boundary Gly variants was calculated using a neighbour-dependent substitution rate model48. This model was chosen since it takes into account the higher rates of substitution seen for transitions than transversions, as well as the effect of the neighbouring nucleotides. This model has also been used previously in the context of Gly substitutions in collagen molecules34. The expected proportion of variants affecting non-collagenous boundary Gly residues was compared with the proportion observed in gnomAD. The population prevalence of individuals with these variants was then calculated based on the allele frequencies reported in gnomAD, correcting for numbers reported in homozygous females.

Statistical analysis

All statistical analyses were performed using R (version 3.6.2). Survival analysis was performed using the survival package49,50. Separate survival curves were produced for each molecular feature using the Kaplan–Meier method, and compared using the log-rank test. The individual contribution of each covariate was then analysed using a Cox proportional hazards model. The overall significance of this model was assessed using the likelihood ratio test.

Logistic regression models were used to examine the associations between the molecular features and haematuria. The proportion of variance in haematuria explained by the model was assessed using McFadden’s pseudo-R2 and calculated using the blorr package51.

Expected and observed proportions of variants in gnomAD were compared with the exact binomial test. For all analyses, a p-value less than 0.05 was considered significant. Figures were produced using the survminer52 and forestplot53 packages.

Results

COL4A5

Kidney failure

One hundred and fifty-seven families were studied, including 129 with at least one male with kidney failure (Table 1a, Fig. 2a–c). Overall, the median age at kidney failure was 26 years (95% CI 25–28). Age at kidney failure did not differ for variants located within the first 20 exons compared with those in exons 21 to 53 (p = 0.41). Substitution of a non-collagenous boundary Gly residue delayed median time to kidney failure by 19 years compared with those not adjacent to a non-collagenous region (p = 0.014), while substitution with a highly destabilising residue shortened median time to kidney failure by 7 years compared with mildly destabilising residues (p = 0.002). Substitution of a non-collagenous boundary Gly residue was independently associated with a decreased risk of kidney failure (p = 0.025), while substitution with a highly destabilising residue was independently associated with an increased risk (p = 0.003) (Fig. 3a).

Table 1 Median age at kidney failure of COL4A5 Gly missense variants reported in LOVD for each molecular characteristic.
Figure 2
figure 2

Proportion of cases without kidney failure for COL4A5 Gly missense variants reported in LOVD. Variants were stratified by (a) molecular location (p = 0.41), (b) collagenous location (p = 0.014) and (c) substituting residue (p = 0.002). (d) Excluding variants affecting NC boundary residues, variants were further stratified by relative location within their local collagenous region (p = 0.14). Censored data points are not shown. NC, non-collagenous.

Figure 3
figure 3

Cox proportional hazards model of kidney failure risk for COL4A5 Gly missense variants reported in LOVD. Analysis was performed for (a) all Gly missense variants (overall significance of model, p < 0.001) and (b) excluding NC boundary Gly variants (overall significance of model, p = 0.014). HR (95% CI), Hazard ratio (95% confidence interval); NC, non-collagenous; ref, Reference group. 

Considering only the subgroup excluding the non-collagenous boundary variants, location within a local collagenous region did not affect the median time to kidney failure (p = 0.14) (Table 1b, Fig. 2d). However, substitution of a Gly residue at the carboxyl end of a local collagenous region was independently associated with an increased risk of kidney failure compared with substitutions in the central region (p = 0.031) (Fig. 3b). Substitution with a highly destabilising residue remained significantly associated with an increased risk of kidney failure for this subgroup (p = 0.004).

Hearing loss

Eighty families were studied, including 42 with at least one report of hearing loss in a male (Table 2a, Fig. 4a–c). Median age at hearing loss diagnosis did not differ for variants located within the first 20 exons compared with those located in exons 21 to 53 (p = 0.85). Unlike with kidney failure, median age at hearing loss diagnosis did not differ between variants affecting non-collagenous boundary Gly residues and variants not adjacent to a non-collagenous region (p = 0.38). However, the sample size for the non-collagenous boundary variants was small (n = 8), and only 2 of these eight families had a report of hearing loss at the time of the study. Substitution with a highly destabilising residue shortened median time to diagnosis of hearing loss by 21 years compared with mildly destabilising residues (p = 0.004), and this was the only molecular feature independently associated with an increased risk of hearing loss (p < 0.001) (Fig. 5a).

Table 2 Median age at hearing loss diagnosis of COL4A5 Gly missense variants reported in LOVD for each molecular characteristic.
Figure 4
figure 4

Proportion of cases without a hearing loss diagnosis for COL4A5 Gly missense variants reported in LOVD. Variants were stratified by (a) molecular location (p = 0.85), (b) collagenous location (p = 0.38) and (c) substituting residue (p = 0.004). (d) Excluding variants affecting NC boundary residues, variants were further stratified by relative location within their local collagenous region (p = 0.77). Censored data points are not shown. NC, non-collagenous.

Figure 5
figure 5

Cox proportional hazards model of hearing loss diagnosis risk for COL4A5 Gly missense variants reported in LOVD. Analysis was performed for (a) all Gly missense variants (overall significance of model, p = 0.002) and (b) excluding NC boundary Gly variants (overall significance of model, p = 0.004). HR (95% CI), Hazard ratio (95% confidence interval); NC, non-collagenous; ref, reference group.

Excluding all non-collagenous boundary variants, location within a local collagenous region did not affect the median time to hearing loss diagnosis (p = 0.77) (Table 2b, Fig. 4d). Risk of hearing loss also did not differ for substitutions at the amino (p = 0.18) or carboxyl ends (p = 0.96) (Fig. 5b). Substitution with a highly destabilising residue remained significantly associated with an increased risk of hearing loss for this subgroup (p < 0.001).

COL4A3/COL4A4

Haematuria

This cohort comprised 304 individuals from the 100kGP, including 48 with documented haematuria (Table 3). In total, 153 unique heterozygous COL4A3 and COL4A4 Gly missense variants were studied, and most (n = 105) were found once only.

Table 3 Haematuria distribution and COL4A3/COL4A4 Gly missense variant features of individuals reported in the 100kGP database.

Location in exons 1–20 was not associated with a difference in risk for haematuria compared with exons 21–53 (p = 0.51). Substitution of a non-collagenous boundary Gly residue was associated with a lower risk for haematuria (p = 0.046), while substitution with a highly destabilising residue was associated with a higher risk (p = 0.018) (Table 4a). However, these features only explained a small proportion of the total variance in haematuria risk (pseudo-R2McFadden = 0.040). Excluding all variants affecting a non-collagenous boundary residue, those affecting the amino or carboxyl ends of a local collagenous region were not associated with a difference in risk for haematuria compared with centrally located variants (p = 0.23, p = 0.20 respectively) (Table 4b). Substitution with a highly destabilising residue did not remain significantly associated with a higher risk for haematuria in this subgroup (p = 0.09).

Table 4 Logistic regression model of molecular characteristics of COL4A3 and COL4A4 Gly missense variants associated with haematuria in the 100kGP database.

Two COL4A3 variants were reported significantly more often in this cohort than any other variant in either gene (Supplemental Fig. 2). These were Gly695Arg (n = 21) and Gly1277Ser (n = 30). Together these accounted for 51/304 (16.8%) of all variants. To ensure that these two variants did not overly influence the results obtained, the logistic regression model was re-evaluated excluding both variants (Supplemental Table 3). Substitution of a non-collagenous boundary residue remained significantly associated with a lower risk for haematuria (p = 0.031), but substitution with a highly destabilising residue now fell just outside the nominal significance level (p = 0.064). The total variance in haematuria explained by the predictor variables remained low (pseudo-R2McFadden = 0.043). Interestingly, in the subgroup excluding the non-collagenous boundary variants, location at the carboxyl end of a local collagenous region was now associated with a higher risk for haematuria (p = 0.041).

Prevalence of COL4A5 non-collagenous boundary variants in normals

Forty-five unique COL4A5 Gly missense variants were reported in gnomAD. The proportion of these variants affecting a non-collagenous boundary residue (15/45 = 33.3%) was higher than the expected frequency of 10.1% (p < 0.001). Of interest, the 5 most commonly reported Gly missense variants all affected a non-collagenous boundary residue (Table 5). These 5 variants predominated in single ancestral groups.

Table 5 The five most frequently found COL4A5 Gly missense variants reported in gnomAD.

Assuming equal numbers of males and females in gnomAD, missense variants affecting non-collagenous boundary residues were present in 0.6% of the general population (1 in 179 individuals). The majority of these cases were due to a single variant, Gly953Val, which is highly prevalent in East Asian and South Asian populations and currently considered benign54. Excluding this variant, missense variants affecting non-collagenous boundary Gly residues were present in 0.05% of the population (1 in 2078 individuals). In contrast, missense variants affecting all other collagenous Gly residues were only present in 0.03% of the general population (1 in 3000 individuals) despite there being nine times as many of these Gly residues in the α5 chain.

Discussion

Previous studies have demonstrated that the clinical severity of X-linked AS in males is closely associated with truncating, large deletion, splice site, and missense variant types in COL4A510,24. This study found that missense variants affecting collagenous Gly residues can be further classified by molecular features that also correlate with severity.

Gly substitutions with highly destabilising residues were associated with an earlier median age at kidney failure. This is consistent with previous studies in other collagen chain genes such as COL1A1 and COL1A2, where Gly substitutions with Arg, Val, Glu or Asp were more likely to result in a lethal phenotype of osteogenesis imperfecta37. Similar observations have been seen for COL4A5 where substitutions with bulkier amino acids lead to an earlier age at kidney failure35. Our results are also consistent with the underrepresentation of substitutions with Ala and Ser in pathogenic databases34,55, which suggests that they are associated with milder and possibly undiagnosed disease.

Substitutions affecting non-collagenous boundary Gly residues resulted in a delayed age at kidney failure. In general, the non-collagenous interruptions within the collagen type IV chains contribute flexibility and non-pathogenic variants are common in these regions56. Gly residues at the boundary of these interruptions may inherit some of this flexibility, which could account for variants’ milder phenotypes. Interestingly, almost all (13/15 = 87%) non-collagenous boundary variants occurred on the amino side of an interruption. Trimerisation occurs in the carboxyl to amino direction, so the side of an interruption that a variant occurs on may affect severity.

Conclusions of the effect of a variant’s molecular location on phenotype severity have been conflicting. Some studies have found that Gly missense variants at the amino end of the collagen IV α5 chain give rise to a milder disease phenotype38, while others have found no such relationship24. Our study has demonstrated that molecular location was not associated with a difference in age at kidney failure. These results also contradict studies in other collagen genes such as COL1A1, that have demonstrated that Gly missense variants occurring at the amino end are more likely to be non-lethal37. However, these other collagen types differ from collagen type IV in a number of structural features such as the presence of interruptions and retention of non-collagenous termini, so that a direct comparison may not be appropriate.

In order to deal with the uncertainty associated with the non-collagenous interruptions and the effect of molecular location, each of the 23 local collagenous Gly XY regions was considered as an individual domain with its own amino and carboxyl ends. Surprisingly, analogous to observations in COL1A1, variants at the carboxyl end of their local collagenous region were associated with an increased risk of kidney failure. Considering that variants affecting non-collagenous boundary residues have a milder effect than most other Gly variants, the variants affecting the next Gly along were also expected to be milder. This difference may be due to the trimer assembly of the three collagen IV α-chains beginning at the carboxyl end of each chain. Boundary variants on the amino side of an interruption may not be as destructive since they only expand the flexible interruption by one or two residues. However, since the trimers are assembled in the carboxyl to amino direction, a substitution of one or two Gly residues further along may affect the next nucleation-zippering event for trimerization, and result in a more severe phenotype. To our knowledge, this is the first report of such an observation. A previous study of the effect of the distance of a variant to its nearest interruption on age at kidney failure did not demonstrate any relationship35.

Substitution with a highly destabilising residue was the only molecular feature associated with an earlier age at diagnosis of hearing loss. The median ages reported here do not predict the age at hearing loss onset, but rather the age at diagnosis. In severe disease, hearing loss onset generally occurs in the first decade, but is often unrecognised and underreported. Affected boys often only undergo audiometry after kidney disease is detected, and usually only when the hearing loss is obvious rather than as a screening test. Nonetheless, these results provide a proof of concept that substitution with a highly destabilising residue has a negative effect on hearing loss phenotype. Variants affecting non-collagenous boundary residues would be expected to also result in a milder hearing loss, but the sample size of this study precluded the demonstration of any differences.

In COL4A3 and COL4A4, substitution of a non-collagenous boundary Gly residue was associated with a lower risk of haematuria, while substitution with a more destabilising residue was associated with a higher risk. This is consistent with our findings in COL4A5, where these features were associated with later and earlier ages at kidney failure respectively. However, the proportion of variance in haematuria explained by these features alone was low, and it is likely that other genetic and environmental factors also contributed to haematuria risk. In addition, the control group used here were individuals where haematuria was not formally noted in their medical records. They were not necessarily individuals with a negative urinalysis, and undiagnosed haematuria in this group was probably higher than reported.

A higher proportion of variants affecting non-collagenous boundary Gly residues was observed in gnomAD than expected. The higher frequency of these variants in the general population supports our conclusion that missense variants involving boundary residues are likely to be much milder. Some may even be benign54. Additionally, all of the 5 most frequently reported variants affected a boundary residue. One of these (Gly624Asp) has been demonstrated to have originated in Central/East Europe due to a founder effect 750–900 years ago57, and other variants have probably also arisen in similar circumstances. We have demonstrated previously that the number of people with a genetic risk for AS in the general population is likely to be higher than currently recognised4, and this study’s results suggest that variants affecting non-collagenous boundary residues are major contributors to this largely undiagnosed population.

In this study we have stratified pathogenic COL4A5 Gly missense variants causing AS into clinically relevant subgroups. We identified molecular features of these variants which are likely to contribute to the clinical severity and disease progression in affected individuals, and provide estimates for the age at kidney failure for each subgroup. However, even within these groups much phenotypic variation still exists, and other genetic factors such as whether a variant affects a ligand binding site58 or is located near a proline-rich sequence59 may also be important. Environmental factors such as blood pressure control and obesity may also affect the clinical course60. Accurate predictions of disease severity and rate of progression are helpful for patients, their clinicians and genetic counsellors, and genetic and clinical data should be considered together when managing patients with AS.