Abstract
We propose that a principle of proportionality be applied to genomic data that weighs the depth of data (what is shared) against the breadth of sharing (with whom).
Main
Broad sharing of research data is well recognized as being important to maximize the generalizable knowledge that can be derived from the combined contributions of researchers, funders and research participants. In parallel, the importance of responsibly sharing clinical data has become increasingly viewed as an integral component of good clinical care, maximizing benefits for patients. These two distinct benefits have led to growing advocacy for data sharing, especially within the area of genetics and genomics, where the openness of the Human Genome Project has catalysed a culture of sharing that is now spreading into the broader life sciences.
The debate over sharing consented health data has become increasingly polarized. At one end of the spectrum, researchers and technology giants argue passionately that we should 'free' the data, citing enormous benefits for the future of science and medicine. At the other end of the spectrum, privacy campaigners implore us to consider the appalling consequences of inadvertent loss of confidentiality, unfair discrimination and personal exploitation.
The current landscape of clinical data sharing initiatives is considerably more diverse and nuanced than these two extreme positions convey1,2. Making progress in discussions about the benefits and harms of data sharing is impossible without achieving clarity about what is meant by both 'data' and 'sharing'. In reality, both the 'depth of data' (what is shared) and the 'breadth of sharing' (with whom) can vary hugely, offering the opportunity to find a proportionate approach that balances doing good (beneficence) and not doing harm (non-maleficence).
Genomic data exemplify many of the issues associated with data sharing
Genomic data exemplify many of the issues associated with data sharing. The data are generally large, digital and identifying; given a dataset and a DNA sample, re-identification is possible from just a tiny proportion of an individual's genome sequence3. A genome sequence contains data with hugely variable levels of scientific uncertainty, predictive utility and personal sensitivity, so the hazards of linking it to private or personally compromising information (such as medical records) are potentially substantial and often unpredictable. Nonetheless, because understanding an individual genome requires comparison with data from many thousands of others, sharing the data widely is an absolute necessity for clinical interpretation.
As the size of genetics research datasets has increased, new models of sharing have been developed
The principle of proportionate sharing of genetic data, which balances the depth of data with the breadth of sharing, has already been applied somewhat independently in different areas of human genetics. As the size of genetics research datasets has increased, new models of sharing have been developed that supplement the more traditional mode of sharing research data through publication. The depth and scale of these data have led to increasing concerns about the potential identifiability of anonymized research participants and the harms (such as unfair discrimination) that might result. These concerns have been amplified by broader societal anxieties about the erosion of privacy through the greater collection of individual-level data by third parties, both commercial and governmental. For example, in the area of complex traits, where personal results are of questionable utility4, individual-level genomic data is typically tightly managed and only shared with bona fide researchers who sign data access agreements, whereas population-level summary data (for example, P values of trait associations or allele frequencies for individual variants) are often made more broadly available. In the area of rare diseases, where finding a molecular diagnosis is key, an alternative approach has been developed that enables broad sharing of individual-level data but limits the depth of the data, perhaps to just one or a handful of genetic variants per individual. Here, the likelihood of re-identification is small, and the associated hazards are contained and predictable, so such data can be shared completely openly with relatively low risk.
Historically, the clinical and research worlds have trodden very different paths with respect to data sharing; however, they are now converging on the same message. In the UK National Health Service (NHS), an Information Governance Review in 2013 resulted in an additional Caldicott Principle in the NHS confidentiality code of practice, stating that “the duty to share information can be as important as the duty to protect patient confidentiality” (Ref. 5). This principle accords with the framework for responsible sharing of genomic and health-related data recently released by the Global Alliance for Genomics and Health, which is grounded in a human rights framework that recognizes a patient's right to benefit from science and to share their data6. Both advocate a thoughtful and balanced approach to data sharing, in which the likely risks and benefits are taken into consideration when deciding what data to share and with whom.
We have tried to exemplify a proportionate approach with the Deciphering Developmental Disorders (DDD) Study, a UK-wide translational study that aims to elucidate the genetic causes of severe, undiagnosed developmental disorders. The study received Research Ethics Committee approval on the basis that diagnostic findings would be returned to participants, but incidental findings would not, with the expectation that this would yield most of the individual benefits while mitigating potential harms7. In addition, the patient information sheet emphasizes that sharing data with other research teams across the world will allow more scientists to investigate the causes of developmental disorders, increasing the chance of important discoveries. As a result, we developed a two-tier approach to data sharing.
First, anonymised individual-level microarray and exome sequencing data (BAM and Variant Cell Format (VCF) files containing ~80,000 variants in every participant) associated with detailed phenotypic descriptions are shared securely with authorized researchers under a managed data access agreement, to enable further research into developmental disorders. Second, a small number of individual variants are shared openly with phenotypic descriptions via the DECIPHER database8. Open data include likely diagnostic variants in genes already robustly implicated in developmental disorders (~1 variant in every 3 participants), which are noted in individual-linked anonymous patient records and shared initially with the referring clinical teams for validation and communication to families before being made public. In addition, possible pathogenic variants in genes not currently implicated in developmental disorders (~1 variant in every participant remaining without a diagnosis) are also shared openly via a dedicated 'research track' in DECIPHER but are linked only to a high-level phenotype description — this prevents identification (including self-identification) of individuals in whom these results may be unwanted, but enables match-making with data deposited by clinicians around the world who may have similar undiagnosed patients with comparable variants. To the best of our knowledge, no harm has befallen any individual patient or family as a result of sharing their genetic variants via DECPIHER. By contrast, we have a plethora of examples in which this approach has led to novel gene discovery and resulted in new diagnoses being made in previously undiagnosed patients.
The ongoing debate around the sharing of genetic data from both medical research and clinical testing reflects the collision of privacy concerns with the desire to enable demonstrable benefits to individuals and to wider society. The uncertain clinical relevance of many genomic variants further complicates this dilemma. It remains debatable whether public sharing of whole-genome sequences — as advocated by the Personal Genome Project9 — or proportionate sharing of selected subsets of genomic data will accrue the greatest benefit to individuals. Although the research community gains hugely from widespread sharing of whole-genome data, the benefit to an individual of sharing their whole genome sequence — above and beyond sharing a handful of the most 'interesting' variants — is currently unproven. Moreover, there is very limited empirical data on what research participants or patients think about this subject. Perhaps the crucial point is that alternative sharing options exist, and may be preferable, for different data types. We suggest that a principle of proportionality should be applied to genomic data, which considers the purpose and risks of data sharing and flexes the depth of data and breadth of sharing to optimize this balance. Rather than pitting research gains against privacy concerns, this approach serves both research and clinical communities, and aims to maximize the proven benefits while minimizing the potential harms.
References
Kaye, J. The tension between data sharing and the protection of privacy in genomics research. Annu. Rev. Genomics Hum. Genet. 13, 415–431 (2012).
van Schaik, T. A., Kovalevskaya, N. V., Protopapas, E., Wahid, H. & Nielsen, F. G. G. The need to redefine genomic data sharing: a focus on data accessibility. Appl. Transl. Genomics 3, 100–104 (2014).
Gymrek, M., McGuire, A. L., Golan, D., Halperin, E. & Erlich, Y. Identifying personal genomes by surname inference. Science 339, 321–324 (2013).
Janssens, A. C. & van Duijn, C. M. Genome-based prediction of common diseases: advances and prospects. Hum Mol Genet. 17, R166–R173 (2008).
Dyer, C. Review recommends duty to share data when in patient's best interests. BMJ 346, f2642 (2013).
Knoppers, B. M. Framework for responsible sharing of genomic and health-related data. HUGO J. 8, 3 (2014).
Wright, C. F. et al. Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet 385, 1305–1314 (2015).
Bragin, E. et al. DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res. 42, D993–D1000 (2014).
Lunshof, J. E., Chadwick, R., Vorhaus, D. B. & Church, G. M. From genetic privacy to open consent. Nat. Rev. Genet. 9, 406–411 (2008).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
M.E.H. is a consultant for and a shareholder in Congenica Ltd, which provides genetic diagnostic services. The other authors declare no competing interests.
Related links
Rights and permissions
About this article
Cite this article
Wright, C., Hurles, M. & Firth, H. Principle of proportionality in genomic data sharing. Nat Rev Genet 17, 1–2 (2016). https://doi.org/10.1038/nrg.2015.5
Published:
Issue date:
DOI: https://doi.org/10.1038/nrg.2015.5
This article is cited by
-
Rethinking the ethical principles of genomic medicine services
European Journal of Human Genetics (2020)