Introduction

Precision medicine, through the use of genomics, offers a golden opportunity to improve the overall health of the population.1 One key element of precision medicine is practice of preventative health, which can be facilitated by predicting individuals who are at risk of developing genetically linked diseases. Expanding on this theme would include carrier screening, or the practice of identifying individuals at risk of bearing children with genetic diseases. Since the initiation of screening for cystic fibrosis in Caucasian and Ashkenazi Jewish populations.2 carrier screening, especially for ethnicity specific variants, has grown to include screening for haemoglobinopathies in individuals of South East Asian, Mediterranean and African ancestry, Tay Sachs disease in Ashkenazi Jewish populations and spinal muscular atrophy in all ethnic groups.3 The availability of genomic datasets from a large, relatively healthy population, such as the Exome Aggregation Consortium (ExAC),4 offers an opportunity to expand carrier screening beyond these few diseases by estimating the carrier frequencies of a larger number of disorders. For example, an analysis of 23,453 individuals for 108 genetic disorders revealed 24% were carriers for at least one disorder and 5% were carriers for multiple disorders.5 Similarly, all 100 participants in the MedSeq project were carriers for a median of 2 variants (range 1 to 7) associated with a recessive condition.6

However, lack of genomics data from non-European populations, represented by less than 20% of genomic data, is a major hurdle towards progressing precision medicine worldwide.4,7 Genomics data forms a critical backbone, upon which precision medicine can be practiced, and paucity of data from such under-represented populations can lead to misdiagnoses and mismanagement.8 To contribute to the expansion of genomic diversity, a joint collaboration between Singapore Health Services (SinghHealth) and its affiliated academic institute, Duke-NUS, established the SinghHealth Duke-NUS Institute of Precision Medicine (PRISM) and aggregated genomes of “healthy” individuals from South East Asia, known as the Singapore Exome Consortium (SEC).

Inherited disorders, some of which are treatable, affect 2–3% of all live births worldwide and result in social and financial burden on the family and society.9,10 Amongst these, 81 have been associated with intellectual disability (ID) of variable severity and onset,10 and in ~20% of the patients, early diagnosis and treatment leads to significant improvement of IQ and related developmental scores. Appropriate medical intervention has also demonstrated improvement of neurological and systemic clinical manifestation as well as behavioural and psychiatric disturbances. As such, we refer to this group of 81 disorders as treatable inherited disorders (TIDs). Examples of some of these TIDs (and their associated treatment) include phenylketonuria (dietary phenylalanine restriction +/− tetrahydrobiopterin (BH4) supplement), Wilson disease (penicillamine) and biotinidase deficiency (biotin supplements). Individually these TIDs are rare, however, the collective burden of TIDs is estimated to be significant from a health care perspective, but the extent of this has not been studied previously. Moreover, although 60% of TIDs can be picked up on screening blood and urine tests, either as newborn or at the time of presentation, some of these conditions, such as citrin deficiency or late onset ornithine transcarbamoylase deficiency, may present with non-specific symptoms and remain undetected at the time of birth. Symptom-specific conditions such as in arginine: glycine amidinotransferase (AGAT) deficiency or Niemann-Pick disease type C, require a ‘single test per single disease’ approach and therefore may not be diagnosed in a timely fashion leading to delays in treatment.11

Taking into consideration the genetic differences between populations of different ancestral backgrounds there may be regional or geographical biases in their prevalence for certain disorders. With this, we aimed to review our SEC cohort for the carrier frequency of diseases associated with TIDs and identify specific diseases, which may be at uniquely higher prevalence in our population, with the purpose of guiding public health policies for carrier and newborn screening using genomic testing.

Materials and methods

Participant recruitment

Participants for this research were obtained from cohorts of pre-existing studies. These studies were institutional ethics review board approved genomics projects. Informed consent was obtained from the eligible individual (or parent/ legal guardian for minors). As part of the research protocol, the genomic data of these individuals was de-identified and analysed in a cumulative manner.

Genomic sequencing

Genomic sequencing, either exome or genome sequencing, was performed on leucocyte derived DNA from consented individuals. Sequencing was performed as per manufacturer’s protocols on Illumina sequencers (HiSeq 2000/ HiSeq 2500 or HiSeq X).

Gene panel

A pre-determined list of genes associated with 81 TIDs as defined by van Karnebeek et al.10 was analysed. This gene list contains 71 autosomal recessive conditions, six X-linked conditions, one autosomal dominant disorder, one mitochondrial disorder, one condition described with both X-linked and autosomal recessive inheritance and one condition described with both autosomal recessive and dominant inheritance. As our bioinformatic pipeline only targeted nuclear coded genes, we excluded genes associated with mitochondrial inheritance and analysed variants in 104 genes associated with 80 disorders as listed in Table S2. All of the listed disorders have medical interventions, which for the majority of the disorders, are affordable, non-invasive and safe.10 Furthermore, although the majority of interventions are based on single case reports or expert opinion without critical appraisal, they are initiated as standard of care in routine clinical practice once a diagnosis is made.10

Bioinformatic analysis

Genomic data, in the form of FASTQ, was processed through an established bioinformatic pipeline to generate variant calling format (gVCF) files. The gVCF files were then combined to create the Singapore Exome Consortium (SEC). Variants were quality filtered to exclude false positives according to standard thresholds (quality scores > 30, coverage > 10 ×, and absence of clustered variants within a window size of 10 variants). From variants that passed this threshold, we extracted variants in each of the genes in our gene list. We then annotated the variants using ANNOVAR12 to include information regarding the gene, genetic change, protein change, type of pathogenic variant (PV) (frameshift, nonsense, nonsynonymous, splicing, and synonymous); prediction of the variant from multiple algorithms (Polyphen-2, SIFT, likelihood ratio test and MutationTaster2), allele frequencies in different databases (Exome Sequencing Project, dbSNP, 1000 Genomes, Complete Genomics, ExAC, and our in-house database of common variants (present in >5% of the population)), and annotation of variants in the clinical mutation database, ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) and Human Gene Mutation Database (HGMD, http://www.hgmd.cf.ac.uk).

Filtering and classification of variants

Variants occurring in one of the genes from our gene list and identified as either (a) “pathogenic”, “probably pathogenic”, or “untested” in ClinVar, or (b) novel (absent or rare in public databases) protein truncating PV (insertions—deletions (indels), stopgain, stoploss or disruption of an essential splice site) were selected for further analysis. As the evidence for pathogenicity of some of the variants listed in the clinical mutation databases as pathogenic may be lacking,13,14 we reviewed the primary literature regarding each of the filtered variants and reclassified them as per the ACMG guidelines on variant classification.13 A literature search for each variant was manually conducted using PubMed, disease specific mutation databases (Table S3) and an in-house mutation database, when available. Only variants that were classified as pathogenic were considered for the final analysis.

Quality control

The aligned sequence read of each exome sample was reviewed to ensure that there was adequate coverage for each of the 104 genes (Table S2). Insufficiently covered samples were excluded when calculating the allele frequency of any particular variant.

Results

Characteristics of the participants

The study participants comprised of 831 individuals with no known pre-existing health conditions or intellectual disability. The participants were aggregated from three existing cohorts: KK Women’s and Children’s Hospital (KKH), Singapore Eye Research Institute (SERI) and National Heart Centre Singapore (NHCS) (Table 1). The average age of the cohort was 50.8 years (S.D. 13.67 years, range 7 to 84 years). The majority of the SEC cohort was of Chinese ethnicity (91%).

Table 1 Summary of participants (n = 831)

Characteristics of the genomic data and variants

Focusing on protein coding regions in the 104 genes associated with TIDs, a mean of 3676 single-nucleotide variants (SNVs) and 379 indel variants were detected per sample. Collectively, 4343 unique variants occurring at an allele frequency of less than 1% were detected amongst the 831 individuals. Upon filtering, we identified 90 variants for further curation - 35 were reported as “pathogenic” or “probably pathogenic” or “untested” in ClinVar (as reported on 19 May 2017) and 55 variants were novel protein-truncating variants (29 indels, 15 splicing alteration variants, 10 stop gain and 1 stop loss) (Fig. 1). We then reviewed the primary literature and re-classified these 90 variants as per ACMG variant classification guidelines - 20 as pathogenic (Table 2) and 25 as likely pathogenic (Table S4). The remaining 45 variants were classified as either variants of unknown significance (VUS, n = 39) or likely benign (n = 6) (Table S4). Some of the variants were found in more than one individual, and collectively, the 45 pathogenic and likely pathogenic variants were observed in 71 individuals.

Fig. 1
Fig. 1
Full size image

Filtering of variants

Table 2 Summary of pathogenic variants detected

Carrier frequency detection

While we detected 45 pathogenic/ likely pathogenic variants, to determine the carrier frequency, we took a conservative approach and included only pathogenic variants (and excluded likely pathogenic variants) as disease causing alleles. The 20 pathogenic variants were detected in 46 unrelated individuals (6% or 1 in 18) in the SEC cohort. These variants were associated with 13 of the 80 treatable forms of intellectual disability as listed in Table 2. All of these conditions are known to follow an autosomal recessive pattern of inheritance. No individual in the SEC cohort was found to be homozygous or compound heterozygous for the pathogenic variants in any of the 80 disorders, nor carry more than one TID causing allele.

Allele frequency comparison between SEC and ExAC

Overall, the disease causing alleles associated with the 13 disorders were found to be more frequent in our SEC cohort by comparison to their corresponding overall allele frequencies in the ExAC data set. This was particularly evident in some conditions such as citrin deficiency, where correlating to the total allele frequency in ExAC demonstrated that variants SLC25A13 p.Arg285fs and c.615 + 5G > A were 24 times more common in our local population (Table 2). Although the occurrence of our local alleles more closely resembled those of East Asian ethnicity in ExAC, certain alleles were more frequent in our local population in comparison to East Asian individuals in ExAC, including ATP7B (associated with Wilson disease) p.T784M (12 times more common), MMAHC (associated with cobalamin C deficiency) p.R132X (6 times more common) and p.R161Q (6 times more common), MUT (associated with methylmalonic acidemia) c.1677-1G > A (6 times more common) and TH (associated with tyrosine hydroxylase deficiency) p.R202H (6 times more common) (Table 2). In addition, a number of reportedly pathogenic variants, such as OXCT1 (associated with SCOT deficiency) p.T58M, were seen in homozygous states in healthy individuals, likely representing misclassification of these variants in ClinVar and/or HGMD (Table S4 and Fig. S1).

Discussion

The SinGapore Incidental Finding (SGIF) study group was set up to develop a formalized framework to understand the prevalence of genetic conditions in our community. In our previous analysis of 377 individuals, we estimated the prevalence of asymptomatic individuals with incidental findings related to adult onset dominant monogenic disorders at 2%.15 We have since developed an exomic data reference bank which, to date, contains sequences of 831 individuals. In this study, we chose TIDs as a model as, despite the rarity of these disorders, their association with ID and availability of treatment make them clinically significant.

Our study found that 1 in 18 individuals, or 6% of the population, carried a PV associated with the risk of having an offspring with these disorders. Five of the 81 TIDs had more than one carrier in our local population (the cumulative prevalence of these 5 disorders was 1 in 23 individuals). The newborn screening program for inborn errors of metabolism in Singapore is based on tandem mass spectrometry and currently screens for 42 disorders, 24 of which are TIDs.16 However as only three of the five most prevalent conditions identified from our analysis are included, this suggests a potential role for genetic based screening as an adjunct tool to identify additional at risk individuals within the local population.

Among the disorders identified, citrin deficiency (1 in 41) and Wilson disease (1 in 103) were the most common. Both of these conditions are recessive and present with features of hepatic, neurologic and/or psychiatric symptoms ranging in severity and age of onset. Notably, due to the non-specific phenotype of these conditions, and that newborn screening commenced in Singapore in 2006, it is possible that many affected individuals may be under-diagnosed.17,18,19,20 Citrin deficiency is caused by a PV in the SLC25A13 gene encoding mitochondrial transported citrin.21 Two major phenotypes have been described: neonatal intrahepatic cholestasis caused by citrin deficiency (NICCD) and childhood to adult onset citrullinemia type II (CTLN2), which can present with neuropsychiatric symptoms.22 Some individuals can present with fulminant liver failure requiring liver transplantation.23 Diagnosis can be confirmed by quantitative amino acid analysis and symptoms in both early and late onset can be treated with a protein-rich, lipid-rich and lactose-free diet.22

Wilson disease has a global carrier frequency of 1 in 90,24 and patients present with hepatic failure and neuropsychiatric symptoms and are successfully treated with copper chelation therapy such as penicillamine. Wilson disease is caused by PV in the ATP7B gene, which results in accumulation of copper in the liver and brain. Diagnosis can be established by biochemical findings of low ceruloplasmin concentrations and presence of ocular Kayser-Fleisher rings. However, these methods do not accurately detect all individuals with Wilson disease25 and diagnosis may be delayed in individuals who present with non-specific symptoms preventing initiation of appropriate treatment.26 Similarly, each of the remaining 78 conditions can be readily managed by interventions such as vitamin supplementation, dietary restrictions or medications such as chelators, emphasising the compelling need and medical actionability for screening for these treatable disorders.10 The timely diagnosis and recognition of the underlying metabolic defect enables clinical management before the symptoms manifest.

This study demonstrates the value of an ethnicity-specific genomic data set to study disease prevalence relevant to the local population. Lazarin et al., 2014, performed a large scale analysis of carrier frequencies targeting a pre-set list of variants from 23,453 individuals of diverse ethnicities.5 In comparison to previously published frequencies, they found discrepancies in several conditions. For example, they detected a carrier frequency of cystic fibrosis to be 1 in 40 in South Asians which had previously been reported as 1 in 118 and an additional under-reporting of carnitine palmitoyltransferase II deficiency in East Asians as 1 in 378, which had been previously reported as rare. Likewise, newborn screening in Singapore has detected a higher incidence of fatty acid oxidation disorders (1 in 6565) in comparison to other Asian population studies in Taiwan, China and Japan (1 in 9300 to 1 in 54,000).16 Without such studies, which are currently underrepresented in Asia, the prioritisation of genetic screening programs to inform public health initiatives is challenging. Such genetic epidemiological studies provide a framework and evidence of genetic risk to prioritise public health initiatives for diagnosis and management of these disorders for a very local, yet internationally applicable, context.

One limitation of the study is the lack of genomic data from the non-Chinese populace of Singapore. Singapore is a multiethnic population comprising of Chinese (76%), Malays (15%), Indians (7%) and other minority ethnicities including Eurasians. Our cohort was predominantly Chinese (91%) and hence underrepresented other ethnicities of Singapore. To address this issue, we have embarked on generating data targeting individuals of Malay and Indian ancestry. Despite this current limitation, this is the first study to report on the prevalence of TIDs in the local population, demonstrating a difference in burden of genetic disorders compared to the Western population. More importantly, the framework provided here can be applied to any country to guide local public health policies, provided there is adequate genomic data, not only in relation to TIDs, but other genetic disorders including monogenic adult onset disorders like hereditary breast and cancer syndromes.15

In conclusion, while the promise of precision medicine is alluring, there is an urgent need for genomic data from populations of underserved countries. With more population-relevant data, healthcare practices can be tailored towards improving preventative care and treatment, taking the fullest advantage of precision medicine technologies to fulfil this promise.