Introduction

A rare disease is defined in Europe as a disorder that affects fewer than one in 2000 people [1]. The Orphan Drug Act defines a rare disease as a condition affecting fewer than 200,000 individuals living in the United States [2]. There are an estimated 300 million people worldwide living with a rare disease [1], corresponding to 3.5 million individuals in the UK and 30 million in Europe. Most conditions have a genetic aetiology. Rare diseases can manifest with physical and mental health challenges and may impact family dynamics, relationships, education, employment and broader society [3]. Disease-modifying treatments are limited. Studies of rare diseases typically involve small sample sizes, which can lead to statistically underpowered analyses and make conventional study designs unfeasible [4]. Research participant recruitment is often restricted to specialist centres, leading to selected, unrepresentative study populations with limited generalisability [4]. Rare disease registries, although informative, are resource-intensive and difficult to scale and maintain [5].

The UK’s National Health Service (NHS) is one of the largest publicly funded healthcare systems in the world. More than 98% of the UK population is registered with a General Practitioner (GP). A GP is a primary care physician with responsibilities for delivering, coordinating and gatekeeping healthcare. Most NHS consultations occur within primary care. There are three main primary care clinical information systems in the UK: EMIS Web (Optum, formerly EMIS Health), SystmOne (The Phoenix Partnership, TPP) and Vision (Cegedim Healthcare Solutions) [6]. These differ in user interfaces, but are all designed for clinical record keeping and can incorporate correspondence from health and social care providers. Clinical terminologies are based on Read codes, Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) and vendor-specific codes, reflecting legacy and current coding systems. Prescriptions are aligned with the NHS Dictionary of Medicines and Devices (dm+d).

There are several large, regularly updated national research databases derived from electronic health records (EHR) of patients registered at UK primary care practices. These databases provide access to routinely collected pseudonymised data. Structured data captures patient demographics, registration details, consultations, diagnoses, presentations, investigations, prescriptions and referrals (Supplementary Fig. 1). Examples of UK primary care EHR databases include Clinical Practice Research Datalink (CPRD) [7, 8], Optimum Patient Care Research Database (OPCRD) [9], QResearch [10], The Health Improvement Network (THIN) [11] and Welsh Longitudinal General Practice (WLGP) dataset accessed through Secure Anonymised Information Linkage (SAIL) Databank [12] (Supplementary Table 1). Strengths include their size, population representativeness and breadth of available data [6]. Data linkage can be established at the individual level with other national datasets capturing hospital admissions, outpatient appointments, indices of deprivation measured at the small-area level, and death registrations.

Published evaluations of UK primary care EHR research databases consistently demonstrate comparability with the national population, in terms of age [7, 9, 11], sex [7, 9, 11], ethnicity [13] and socioeconomic position [7, 9]. They have supported research in epidemiology, health economics, risk prediction modelling and randomised controlled trials to generate real-world evidence informing national guidelines [14, 15]. Study eligibility is primarily determined by diagnoses recorded in routinely collected health data, thereby widening participation to individuals who might otherwise be unable or unwilling to take part in traditional research studies [4]. Individual consent is not required; instead, access to pseudonymised data is subject to study-level approval by an independent scientific advisory board and supported by national legal frameworks, including Section 251 of the NHS Act 2006, UK General Data Protection Regulation (GDPR) and the Data Protection Act 2018. Approved research is conducted within a trusted research environment or secure data platform. Stringent governance stipulations are in place to protect privacy and confidentiality, with strict disclosure controls (e.g., not reporting cell counts below five).

UK primary care EHR databases provide a secure and representative platform for conducting health research at scale and appear well-suited to rare genetic diseases (Supplementary Fig. 2). A multistakeholder task force convened by the International Rare Diseases Research Consortium (IRDiRC) highlighted longitudinal primary care records as a potential resource for understanding natural history, supporting diagnosis and improving management [16]. This was also recognised in the UK Department of Health and Social Care Task and Finish Group 2021 report, The Diagnostic Odyssey in Rare Diseases, which suggested UK primary care EHR databases could be used to measure temporal changes in diagnostic odyssey [17]. Their broader application, however, remains poorly characterised. This review aimed to systematically identify studies of rare genetic diseases conducted using UK primary care EHR databases, describe their characteristics and synthesise the mapped literature to assess the opportunities and challenges of using these databases for conducting rare genetic disease research.

Materials and methods

Data sources, study eligibility criteria, study screening, selection and data collection

The online bibliographies of five major UK primary care EHR research databases (CPRD, OPCRD, QResearch, SAIL Databank and THIN) were accessed on 21 January 2025 via their publicly available URLs. Titles were independently assessed by two clinician investigators using a two-stage procedure. In the first stage, all titles indexed in the bibliographies were screened to identify all potentially relevant studies, including those where eligibility could not be inferred from the title. In the second stage, full-text assessments were undertaken for all articles identified in stage 1 against predefined eligibility criteria. Inclusion criteria were: (i) peer-reviewed publications; (ii) use of one or more of the five specified UK primary care EHR databases (CPRD, OPCRD, QResearch, SAIL Databank and THIN); (iii) study of a germline rare genetic disease catalogued in Orphanet (with a prevalence of <1 in 2000 in Europe); and (iv) findings reported separately for the rare genetic disease. Exclusion criteria were: (i) conditions with non-germline genetic or non-genetic aetiology (e.g., oligogenic, polygenic, multifactorial, somatic, or unknown); and (ii) findings not disaggregated by rare genetic disease (i.e., included only as part of aggregated groups, composite phenotypes, or descriptive counts without disease-specific outcome reporting). No restrictions were imposed on the publication date or study design. Discrepancies were resolved by consensus, with input from the wider study team where required. Summary data were extracted from all eligible publications.

Data extraction procedure and narrative synthesis

Data from eligible studies were extracted using a standardised form. Study characteristics were recorded, including rare genetic diseases, methodologies, key findings, impact and implications. Each condition was verified and summarised using information from Orphanet (www.orpha.net), a rare disease information portal and ontology. Each Orphanet condition is assigned a unique ORPHAcode mapped to other classification systems, including Online Mendelian Inheritance in Man (OMIM) and the World Health Organization (WHO) International Classification of Diseases, 10th revision (ICD-10). Condition name, ORPHAcode, OMIM codes, ICD-10 codes, inheritance and prevalence estimates were extracted from Orphanet. The WHO online browser tool (https://icd.who.int/browse10/2019/en) was used to document the official ICD-10 terms and index terms. This enabled assessment of whether the rare disease had a specific ICD-10 code or was classified under a broader diagnostic category. ICD-10 remains the diagnostic classification system used in national administrative datasets linked with UK primary care databases, including Hospital Episode Statistics (HES) and Office for National Statistics (ONS) death registrations. Study methodology from all eligible studies was summarised by study design, case ascertainment, case definition, comparator details, sample size, study period, datasets, outcome domains, statistical analyses, key findings, implications and impact. The use of ethnicity and area-level deprivation data was recorded. Patient and public involvement and engagement (PPIE) activities were documented. Funding declarations for each publication were collated. Together, this informed a narrative synthesis of current and potential applications of UK primary care EHR databases for rare genetic disease research.

Rare genetic disease studies are underrepresented in research outputs from UK primary care EHR databases

First, we set out to identify all peer-reviewed publications reporting rare genetic diseases from five major longstanding UK primary care EHR databases: CPRD, OPCRD, QResearch, SAIL Databank and THIN. In total, these database bibliographies indexed 5754 studies published between 1987 and 2025 (Supplementary Fig. 3). Title screening yielded 198 articles for full-text review. Of these, 152 were excluded. Reasons included failure to report rare genetic diseases separately. One additional publication was identified from the reference list of an eligible paper. In total, 47 articles reported rare genetic diseases, corresponding to 0.82% of all indexed research outputs from the five databases reviewed (Fig. 1A).

Fig. 1: Identification and characteristics of publications reporting rare genetic diseases using UK primary care electronic health record research databases.
Fig. 1: Identification and characteristics of publications reporting rare genetic diseases using UK primary care electronic health record research databases.The alternative text for this image may have been generated using AI.
Full size image

A Bibliographic identification and screening outcomes. Stacked squares represent indexed research outputs from five UK primary care databases as of 21 January 2025 (one square is 300 records), coloured by database. The adjacent traffic light summarises title screening (n = 5754), full text assessment (n = 198), and eligible peer-reviewed publications (n = 47). B Temporal distribution of eligible publications (one square is one publication), coloured by database and plotted by publication year. Solid outlines indicate studies with a primary focus on the rare genetic disease; dashed outlines indicate studies in which the condition formed part of a broader research remit. C Condition-level mapping of eligible publications. Each square represents a single publication and is repeated for every condition reported; multi-condition studies therefore contribute multiple squares. Conditions are ordered from most to least frequently reported: DM1, myotonic dystrophy type 1; CF cystic fibrosis; HD Huntington’s disease; TSC tuberous sclerosis complex; SCD sickle cell disease; DMD Duchenne muscular dystrophy; CAH congenital adrenal hyperplasia; HHT hereditary haemorrhagic telangiectasia; vWD hereditary von Willebrand disease; XLH X-linked hypophosphataemia; BMD Becker muscular dystrophy; FSHD facioscapulohumeral muscular dystrophy; ADPKD autosomal dominant polycystic kidney disease; β-T beta-thalassaemia; DS Dravet syndrome; LQTS congenital long QT syndrome; AATD alpha-1 antitrypsin deficiency; ACH achondroplasia; FXS Fragile X syndrome; MFS Marfan syndrome; NS Noonan syndrome; 45X, Turner syndrome; 22q11.2DS 22q11.2 deletion syndrome.

The most frequently used database was CPRD (n = 35; 0.93% of CPRD publications), followed by SAIL Databank (n = 6; 1.07%), THIN (n = 4; 0.40%), OPCRD (n = 1; 0.88%) and QResearch (n = 1; 0.32%) (Table 1). All eligible studies were published from 2011 onwards (Fig. 1B). Most studies used a single UK primary care EHR database (42 of 47). Five studies combined CPRD Gold and CPRD Aurum [18,19,20,21,22], which differ by EHR system and population coverage [7, 8] (Supplementary Table 1). Three studies also used international data [23,24,25], including analysis of the MarketScan USA claims database in parallel with CPRD [23], a SAIL Databank study forming part of the Establishing a linked European Cohort of Children with Congenital Anomalies (EUROlinkCAT) consortium [24] and a study conducted in Wales and Denmark [25]. Public funders based in the UK, USA, Canada and the EU supported more than half of the studies (25 of 47), either alone (n = 19) or in combination with charitable or industry funding (n = 6) (Supplementary Table 2). Of the remaining studies, fourteen were funded exclusively by industry, seven exclusively by charitable organisations and one reported no dedicated funding.

Table 1 Summary of publications from UK primary care electronic health record databases reporting rare genetic diseases, by primary care database, condition and outcome domains.

Next, we examined the use of linked datasets. Eleven studies relied solely on primary care data [23, 26,27,28,29,30,31,32,33,34,35]. A further five studies used primary care records linked only to area-level deprivation data [18, 36,37,38,39]. Overall, 77% of studies (36 of 47) linked primary care records to one or more external datasets, including hospital, death registrations and area-level deprivation data (Supplementary Table 3). The most frequent linkage was hospital admissions data, used in 26 studies. These were Hospital Episode Statistics Admitted Patient Care (HES APC), which captures all NHS-funded inpatient and day-case admissions in England and the Patient Episode Database for Wales (PEDW). HES APC was used in 20 CPRD studies [19,20,21,22, 40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55] and one QResearch study [56]. PEDW was used in five of six SAIL Databank publications [25, 57,58,59,60]. Area-level deprivation measures were used in 19 studies [18, 25, 36,37,38,39, 44, 46, 48, 52,53,54,55,56, 58,59,60,61,62]. The Index of Multiple Deprivation (IMD) featured in 15 studies: 11 from CPRD [18, 39, 44, 46, 48, 52,53,54,55, 61, 62] and four from SAIL Databank [25, 58,59,60]. The Townsend Deprivation Index was used in three THIN studies [36,37,38] and one QResearch study [56]. Other core linkages were ONS death registrations in 15 studies [20,21,22, 40, 45, 46, 48, 50, 51, 53, 56, 61,62,63,64] and hospital outpatient data in 10 studies, including Hospital Episode Statistics Outpatients (HES OP) in England for nine CPRD studies [40,41,42,43, 45, 50, 51, 54, 55] and the Outpatient Database for Wales in one SAIL Databank study [60].

We then examined the function of linked datasets within included studies. Approximately one third of publications used linked datasets to define the study population using routinely recorded diagnostic records, usually alongside primary care records (Supplementary Table 3). Linked datasets were also frequently used to derive covariates (e.g., IMD) and a broad range of outcome domains, including comorbidities and complications (Table 1). Mortality outcomes were reported in 17 studies, including 11 that linked with the gold-standard ONS death registration dataset [20, 21, 40, 45, 50, 51, 53, 56, 61, 62, 64] and six that did not [26, 31, 37, 54, 55, 59]. A further four studies used ONS death registrations to ascertain the date of death for censoring (i.e., end of follow-up) or cause of death information to support outcome ascertainment [22, 46, 48, 63]. Other common outcomes included rare disease prevalence and incidence (n = 14), prescribed medications (n = 11) and healthcare utilisation or costs (n = 8). Mental health outcomes were the focus of three studies [20, 32, 45]. One SAIL Databank study [58] investigated educational attainment and special educational needs designation using the National Pupil Database (Table 1).

In summary, rare genetic disease research was markedly underrepresented in outputs from five major UK primary care EHR databases, accounting for fewer than 1% of publications. All eligible studies were published from 2011 onwards, with CPRD as the predominant platform and smaller contributions from SAIL Databank, THIN, OPCRD and QResearch. Linked dataset usage broadened the scope of research and outcome domains spanned health [51], education [58] and mortality [53].

Rare genetic disease studies from UK primary care EHR databases are concentrated on a small number of disorders

Next, we assessed how rare genetic diseases were represented across the literature. Thirty-six of 47 studies had a primary focus on a rare genetic disease (Fig. 1B). The remaining 11 studies had a broader research remit that included a rare genetic disease, but not as the primary focus. For example, a life course study examined age-specific incidence and period prevalence of 308 phenotypes, including cystic fibrosis and sickle cell disease [47]. All 36 articles with a primary focus investigated a single condition. The largest number of genetic rare diseases reported in a single publication was four [18, 34], forming minor components of studies with a broader research remit. In total, 23 rare genetic diseases were studied (Table 2). Twelve conditions featured in multiple publications, and 11 conditions were reported in a single publication (Fig. 1C). Myotonic dystrophy type 1 was the most studied condition with nine associated publications, followed by cystic fibrosis (n = 8) and Huntington’s disease (n = 7). Together, these three conditions accounted for more than 50% of studies.

Table 2 Rare genetic diseases investigated using UK primary care electronic health record research databases, with ORPHAcode, Orphanet prevalence estimates, inheritance patterns, age of onset, and associated publications from this review.

We then mapped conditions to established rare disease identifiers. Aligned with our eligibility criteria, all 23 conditions had an ORPHAcode (Table 2). Three conditions were classified at the group-level (congenital adrenal hyperplasia, congenital long QT syndrome and sickle cell disease) and 20 at the disorder-level. Group-level classification refers to a collection of clinical entities sharing common features, whereas disorder-level classification denotes individual clinical entities for which a definitive clinical diagnosis can be made. The 23 conditions mapped to 82 OMIM codes, reflecting substantial genetic and phenotypic heterogeneity (Supplementary Table 4).

To indicate where each condition lay on the rare spectrum, we extracted Orphanet prevalence estimates. All conditions in the review fell within the two most frequent prevalence bands (1-5 in 10,000 to 1-9 in 100,000). Four conditions had ‘unknown’ in the structured Orphanet prevalence field, but the accompanying summary text provided estimates at birth (Table 2). The lowest prevalence reported in the review was for juvenile Huntington’s disease (onset before 21 years), with a minimum prevalence of 6.77 per million patient-years and period prevalence in 2010 of 1 in 385,000 [27]. Cystic fibrosis was the most prevalent condition, estimated in one study at 5.76 in 10,000 live births, based on ascertainment from multiple datasets [25]. This estimate marginally exceeded the rare disease prevalence threshold; however, as cystic fibrosis remains classified by Orphanet as a rare genetic disease (ORPHA:586), for completeness, we opted to include it in the review. Rare disease sample sizes ranged from 21 to 5059 cases, with a median of 392 (Supplementary Table 3). A recent CPRD study of beta-thalassaemia had a starting population of 11,359, with analysis restricted to 237 individuals with transfusion-dependent disease [55]. To indicate overall scale, we summed the largest sample size reported for each condition, which resulted in a minimum combined sample size of 21,340 (Supplementary Table 5).

To explore factors that may have influenced which conditions were studied, we reviewed the genetic basis, inheritance and age of onset (Supplementary Table 4). Twenty-one of the 23 conditions were single-gene disorders, and two were chromosomal. Thirteen conditions were autosomal dominant, five were autosomal recessive and four were X-linked. Most were early onset conditions, spanning prenatal (achondroplasia), neonatal (congenital adrenal hyperplasia), infancy (cystic fibrosis) and childhood (Duchenne muscular dystrophy). Five were adult-onset conditions, including alpha-1 antitrypsin deficiency, autosomal dominant polycystic kidney disease, facioscapulohumeral muscular dystrophy, hereditary haemorrhagic telangiectasia and myotonic dystrophy type 1. Congenital and childhood-onset forms of myotonic dystrophy type 1 also occur [46], reflecting variable expressivity and trinucleotide repeat expansion size.

Drawing on OMIM, Orphanet, and our own clinical experience, we characterised conditions further by primary phenotypes and affected systems (Supplementary Table 4). Most conditions were multisystem, with diverse primary phenotypes; the most common were developmental (n = 6), neuromuscular (n = 4) and haematological (n = 3). We then assigned the clinical specialty typically responsible for leading care for each condition, recognising variation in service organisation and that care often involves multidisciplinary teams, specialist services and clinical genetics input. Neurology was the lead specialty for seven conditions, followed by haematology for three (Supplementary Table 4). Other clinical specialties included paediatrics, cardiology, endocrinology, respiratory medicine and nephrology.

Finally, noting that targeted therapies may have contributed to research activity, we reviewed established treatments. Of the 23 conditions, 11 had therapies beyond supportive or symptomatic management (Supplementary Table 4). These included replacement-based treatments such as corticosteroid and mineralocorticoid replacement for congenital adrenal hyperplasia, growth hormone therapy for Turner syndrome, von Willebrand factor concentrates for hereditary von Willebrand disease, and plasma-derived augmentation therapy for alpha-1 antitrypsin deficiency, as well as targeted or disease-modifying therapies, such as burosumab for X-linked hypophosphataemia, CFTR modulators for cystic fibrosis, mTOR inhibitors for tuberous sclerosis complex, tolvaptan for autosomal dominant polycystic kidney disease and vosoritide for achondroplasia. Casgevy (exagamglogene autotemcel) has recently become clinically available for transfusion-dependent beta-thalassaemia and sickle cell disease, coinciding with two recent CPRD publications [54, 55]. Exon-skipping therapies for Duchenne muscular dystrophy are approved in the USA and Japan, and preliminary findings from the Huntington’s disease AMT-130 gene therapy trial at 36 months showed promising results.

In summary, outputs from UK primary care EHR databases span a broad range of rare genetic diseases, but research activity is skewed towards multisystem, neurological, autosomal dominant, single-gene disorders with relatively higher population frequencies and established or emerging treatments.

Cohort designs predominate in rare disease studies using UK primary care records

To demonstrate the range of methodological approaches used, we summarised study designs (Supplementary Table 3). Most studies used cohort designs (n = 38; 81%). The remainder were case-control studies (n = 3), cross-sectional studies (n = 2) and methodological evaluations (n = 4). Among the 38 cohort studies, 26 included a comparator group: 19 used individual-level matched comparators, six used non-matched internal comparators and one used an external dataset for comparison. Next, we examined individual-level matching strategies, implemented in cohort, case-control and cross-sectional designs. Age and sex matching were used in all matched studies (n = 23). Nineteen studies also matched on primary care practice. Less common matching variables were geographical region, ethnicity and deprivation. A case-control study used propensity score matching based on age, sex, BMI, smoking status, ethnicity and primary care practice [35]. Where reported, index dates for matching were typically aligned to the earliest diagnostic record, calendar year, or GP registration. Some studies used data completeness eligibility thresholds, such as a minimum primary care registration period [23] or evidence of healthcare activity [41, 42]. The number of comparators matched to each rare disease case ranged from 1:2 to 1:40, with 1:5 being the most common in six studies, followed by 1:20 in five studies. The highest ratio of 1:40 was from a case-control study using risk-set sampling [20].

We then appraised routes of ascertainment (i.e., data sources used to define the rare disease study population). In 46 of 47 studies, eligibility was defined by primary care diagnostic codes, with 28 exclusively using primary care data and 18 studies also permitting diagnostic records from linked datasets (Supplementary Table 3). One study exclusively used the Congenital Anomaly Register and Information Service (CARIS) in Wales [24]. Where secondary care administrative records contributed to case ascertainment, this was based on ICD-10-coded hospital admissions from HES APC in England and PEDW in Wales. We evaluated whether the 23 conditions from the review mapped to specific ICD-10 descriptors explicitly named for each disorder. Around half of the conditions (11 of 23) had a dedicated ICD-10 code and the remaining 12 were classified into broader diagnostic groups (Supplementary Table 4).

We then examined how rare genetic diseases were defined. Most studies used simple case definitions based on the presence of at least one routinely recorded diagnostic code (Supplementary Table 3). More complex case definitions incorporated additional criteria such as prescriptions (e.g., corticosteroids for congenital adrenal hyperplasia [45]), demographic restrictions (e.g., restricted Duchenne muscular dystrophy cohort to males aged under 50 years [21]) and exclusion rules (e.g., hereditary von Willebrand disease presumed in the absence of diagnostic records for conditions associated with acquired disease [32, 49]). Other studies used multiple criteria or algorithm-based definitions [39, 64]. Event-based criteria were used to define recurrent vaso-occlusive crises in sickle cell disease [54] and transfusion-dependent beta-thalassaemia [55]. Some studies used sensitivity analyses or validation exercises to assess the robustness of case definitions. One study assessed the impact of applying a broader case definition for hereditary haemorrhagic telangiectasia [36]. Another tested alternative definition for Duchenne muscular dystrophy requires at least two diagnostic records or an ICD-10 code in HES APC [21]. A further study validated Huntington’s disease diagnoses by reviewing free-text entries to exclude misclassification of unaffected individuals with a family history [61]. This approach is no longer feasible under current UK GDPR restrictions [6, 7]. Only one study undertook external validation, against the UK Cystic Fibrosis Registry and found that combining diagnostic records from primary care and linked datasets improved sensitivity with minimal loss of specificity [57].

Where diagnostic codes were unavailable, broad, or infrequently used, adapted case definition strategies were used. For example, X-linked hypophosphataemia was defined by combining broad skeletal phenotypic descriptors, biochemical results and prescriptions [39, 64]. Likelihood grading was independently undertaken by two national clinical experts in familial hypophosphataemia, with high inter-grader agreement. Similarly, probable Dravet syndrome was defined by a diagnostic record of epilepsy together with a prescription of stiripentol or potassium bromide [50]. A third example distinguished achondroplasia from hypochondroplasia, which shares ICD-10 code Q77.4, when the age of diagnosis was before two, or height was within the achondroplasia reference range [51].

Two studies suggested probable misclassification of carriers of autosomal recessive and X-linked conditions. Cystic fibrosis was found to have a bimodal distribution for diagnostic age, with peaks in early childhood and at 30 years [47]. The second peak was attributed to carriers, likely misclassified following parental genetic testing or during family planning. Another study found that 324 females had a diagnostic record of Duchenne muscular dystrophy, which far exceeds those expected to have a classical phenotype [21]. This study also reported 12 males over the age of 50 with a record of Duchenne muscular dystrophy. These findings were considered clinically implausible, but only comprised 1.1% of the cohort [21]. Sensitivity analyses showed no material effect on the findings.

Overall, most studies used cohort designs and primary care diagnostic codes to delineate rare genetic disease study populations. Case definitions were adapted when coding was limited.

UK primary care EHR databases provide insights into the epidemiology, natural history and clinical management of rare genetic diseases

To illustrate the capabilities of UK primary care EHR databases, we synthesised key findings, implications and impact of studies (Supplementary Table 3). We provide three exemplars to demonstrate the breadth of insights achievable.

The first exemplar showcases the capacity to investigate population-level diagnostic patterns and phenotypic variation. Using THIN, one study estimated hereditary haemorrhagic telangiectasia prevalence annually by age, sex, geographical region and socioeconomic position [36]. The study reported higher UK prevalence than previously recognised and identified diagnostic disparities. There was marked female predominance despite the condition’s autosomal dominant inheritance [36]. These findings suggest sex-modified phenotypic expression and are consistent with international liver transplant registry reports, in which most hereditary haemorrhagic telangiectasia-related liver transplant recipients are female [65]. Registry studies also report that females have more severe hepatic and pulmonary arteriovenous malformations and undergo more invasive procedures [65].

The second exemplar illustrates the potential for multisystem phenotyping in ultra-rare diseases. A CPRD study reported premature mortality in X-linked hypophosphataemia compared with matched comparators [64]. A follow-up study examined 273 resource-intensive comorbidities across 15 disease categories and found a higher prevalence of endocrine and neurological disorders [39]. Four individual comorbidities occurred at least twice as often, including depression, which remained significant after multiple testing correction. These findings extend the recognised phenotype of X-linked hypophosphataemia beyond its classical skeletal manifestations.

The third exemplar illustrates how nationally representative rare disease cohorts can quantify risk and support evaluation of risk-modifying treatments. Using CPRD, a matched cohort study of 1061 individuals with myotonic dystrophy type 1 reported a five-fold increased risk of basal cell carcinoma versus 15,119 matched comparators [44]. Non-melanoma skin cancer is not typically recorded in cancer registries; therefore, this analysis would be challenging to replicate using alternative datasets. Additional studies investigating myotonic dystrophy type 1 identified increased risks of benign [63] and malignant tumours [46], with evidence that age at diagnosis of myotonic dystrophy type 1 appears to modify cancer susceptibility. A further study suggested that metformin may attenuate cancer risk in individuals with myotonic dystrophy type 1 who also had type 2 diabetes mellitus [48].

Finally, we examined the use of equity-related variables and PPIE in studies of rare genetic diseases using UK primary care EHR databases. Area-level deprivation appeared in 19 studies [18, 25, 36,37,38,39, 44, 46, 48, 52,53,54,55,56, 58,59,60,61,62]. By contrast, ethnicity data were only reported in seven studies [35, 47, 52, 54,55,56, 60]. No studies reported PPIE activities in their publications.

Discussion

To our knowledge, this is the first detailed examination of how UK primary care EHR databases have been used to study rare genetic diseases. Nonetheless, some limitations should be acknowledged. Eligibility was restricted to peer-reviewed publications explicitly reporting the use of five named UK primary care EHR databases. Consequently, research using other data sources, including newer national primary care EHR data initiatives (e.g., OpenSAFELY), regional datasets and non-UK data resources, fell outside the scope of this review. Study identification relied on database bibliographies and indexing practices, which may have resulted in some relevant studies being missed. Restricting included outputs to peer-reviewed articles published in academic journals may have overlooked reports produced by pharmaceutical companies around drug development and regulatory approval. Finally, because the review was designed specifically to map published studies of rare genetic diseases, the findings may not be generalisable to rare diseases with non-genetic aetiologies.

Notwithstanding these limitations, our review identified several important findings. We show that despite their demonstrated capacity, versatility, scale and population representativeness, UK primary care EHR databases are markedly underutilised for rare genetic diseases (Fig. 1). The low volume of research is particularly striking when contrasted with the broader research activity of these databases [15]. A scientometric analysis of CPRD, THIN and QResearch from 1995 to 2015 found that each of the top 30 conditions accounted for at least 3% of total research outputs [15]. In contrast, only seven rare genetic disease studies were published over the same 20-year period [26,27,28, 33, 34, 36, 37], representing less than 0.4% of outputs and a 32-fold disparity relative to diabetes mellitus publications [15]. This imbalance is striking given that the annual economic costs to society  of 373 rare diseases estimated in a USA-based study was US$2.2 trillion, compared with US$3.4 trillion for common diseases such as diabetes mellitus, cardiovascular disease and cancer [3].

Cystic fibrosis, myotonic dystrophy type 1 and Huntington’s disease together accounted for more than half of the studies in our review. Their prominence likely reflects a combination of factors, including diagnostic visibility, clinical familiarity, longstanding recognition in medical practice and availability of codes in routinely collected health data. Reuse of existing codelists and repeated outputs from the same research groups [44, 46, 48, 63] also contributed to condition recurrence in the literature. Our findings align with the NIHR Rare Diseases Research Landscape Report [66], which described a skewed distribution of rare disease research activity during 2016 to 2021, in which a small number of rare conditions accounted for a large share of research and most had no visible research [66]. UK primary care databases were largely overlooked in this report. One illustrative example was Mendelian’s MendelScan, an artificial intelligence case-finding platform for rare diseases using data from approximately 50 NHS primary care practices in England [66]. At the time of the report, this valuable initiative was substantially smaller in scale than the databases described in our review (Supplementary Table 1) but illustrates an additional application not otherwise captured in our review.

Findings from the IRDiRC State of Play Report 2019-2021 were also consistent [67], with 35% of conditions in our review (8 of 23) also appearing among the top 20 most researched non-neoplastic rare disorders worldwide. Rare neurological disorders accounted for the largest share of research globally (37%), mirroring the distribution observed in our review (Table 2). In the NIHR report, cystic fibrosis was given as an exemplar condition with high research activity attributed to its relatively high prevalence and biologically tractable therapeutic targets [66]. Consistent with this, around half of the conditions in our review had established treatments (Supplementary Table 4). Among the 36 studies with a primary focus on a rare genetic disease, pharmaceutical companies were the sole funders of 13 studies spanning nine conditions (Supplementary Table 2). This supports the hypothesis that therapeutic tractability is a key factor shaping current research activity. Extending the use of UK primary care EHR databases to more rare diseases is likely to require prioritisation and coordinated investment.

A major strength of UK primary care EHR databases is that they allow rare disease cohorts to be defined within a representative population-based setting and followed up longitudinally. Delineating comparator cohorts drawn from the same source population is another key strength. This was reflected in our review by the predominance of matched cohort designs (Supplementary Table 3). For rarer outcomes, alternative study designs may be more suitable; for example, death by suicide in Huntington’s disease was examined using a case-control design with risk-set sampling [20], as a matched cohort design would have been underpowered. Databases are most informative when conditions and outcome phenotypes can be identified with confidence in primary care and linked datasets. All conditions in our review had Orphanet prevalence estimates ranging from 1 in 2000 to 1 in 100,000 (Supplementary Table 5), suggesting conditions in this prevalence range may be feasible to study. Coding specificity varies considerably between conditions and case definitions may be strengthened using corroborative evidence for condition-specific features [51, 64], diagnostic confirmation in linked hospital data [21] and, where possible, external validation [57]. Sensitivity analyses can assess the robustness of findings to alternative case definitions [21, 33, 36]. Databases are currently less informative for questions requiring deep phenotyping or outcomes poorly captured in routinely collected health data. By contrast, they support research questions pertaining to rare disease prevalence and incidence, comorbidities, prescribed medications, health economics and mortality (Table 1). Their effective use requires expertise in epidemiology, statistical analyses, coding frameworks, clinical interpretation and familiarity with NHS healthcare delivery. Approval processes, access requirements and timescales vary across databases [6] and may present practical barriers to their wider use in rare disease research.

All studies in the review with a primary focus on a rare genetic disease examined a single condition (Supplementary Table 3). In contrast, other studies concurrently investigated up to 308 conditions [47], demonstrating technical feasibility to investigate multiple rare diseases. The first attempt to enumerate all rare diseases identifiable in the UK population health datasets was published in February 2025 [68]. This study used the General Practice Extraction Service Extract for Pandemic Planning and Research (GDPPR, data from 98% of NHS primary care practices in England) and NHS hospital data to estimate prevalence and COVID-19-related mortality risk for all identifiable rare diseases [68]. Rare diseases were defined as Orphanet entities that mapped to ICD-10 or SNOMED CT codes with high specificity. Using this approach, 331 rare diseases were identified [68]. Eight non-mutually exclusive categories derived from Orphanet’s classification tags informed subgroup analysis. The genetic disease group was heterogeneous, encompassing high-penetrance Mendelian genetic disorders (e.g., Smith-Magenis syndrome), clinical syndromes with variable genetic contribution (e.g., Lennox-Gastaut syndrome) and several predominantly non-genetic entities and descriptive diagnoses (e.g., interatrial communication (i.e., atrial septal defect), congenital laryngomalacia and isolated plagiocephaly) [68]. The infrastructure to scale rare genetic disease research using routinely collected UK population health data exists, but case definitions, disease groupings and interpretation require careful clinical consideration.

Only seven of the 23 conditions from our review were identified among the 331 rare diseases in the GDPPR-linked study [68]. Achondroplasia was designated as having a highly specific ICD-10 code (Q77.4); however, this code is shared with hypochondroplasia [51]. This limitation was addressed in a CPRD study by refining the eligibility criteria using age at diagnosis and height data [51]. Other studies in our review adapted case definitions where diagnostic codes were unavailable, inconsistently applied, or ambiguous [39, 51, 64]. The flexibility to refine case definitions using clinical logic is a major strength of UK primary care research. Case definitions require bespoke curation using established codelist development methodologies, including systematic searching of data dictionaries with clinician oversight. Scaling this approach to multiple rare diseases is challenging. Using Orphanet mappings to clinical terminologies is pragmatic and as Orphadata is updated annually, these will expand over time. However, application and interpretation should be informed by clinicians working in genomic medicine.

Improving rare genetic disease research using primary care data depends largely on the suitability of clinical vocabularies. SNOMED CT is the core terminology used in UK general practice, but representation of rare diseases is limited [68]. NHS hospital administrative datasets use ICD-10 codes, which also lack precision for most rare diseases. The planned transition to ICD-11 introduces closer alignment with Orphanet. Embedding ORPHAcodes in existing clinical information systems and national administrative datasets could improve rare disease specificity and interoperability. Many rare genetic diseases are diagnosed in hospital outpatient settings; however, diagnoses are recorded for only a minority of outpatient appointments in routine datasets. Mandating this would strengthen case ascertainment. A critical next step is the integration of genomic data. None of the studies in our review linked primary care records to genomic data, reflecting this absence in standard linkage schemes available in UK primary care databases. Feasibility has nevertheless been shown in SAIL Databank, where primary care records were linked to whole exome sequencing data to investigate epilepsy outcomes [69]. Establishing a secure linkage between primary care records and data from NHS genomic medicine services would enable cohort validation and genotype-phenotype studies.

Historically, UK primary care databases relied on networks of contributing practices. SAIL Databank was the only database in our review designed to capture a national population, with primary care records available for around 90% of Wales (Supplementary Table 1). The COVID-19 pandemic prompted the development of whole population analytic platforms in England, including GDPPR, CVD-COVID-UK/COVID-IMPACT and OpenSAFELY. Although initially focused on COVID-19, their remit is expanding, creating opportunities for rare disease research [68]. In April 2025, the UK Government announced a £600 million partnership to establish the Health Data Research Service as a single secure access point to linked NHS data for approved research. On a European level, primary care data from CPRD contributes to the Data Analysis and Real World Interrogation Network (DARWIN EU®), coordinated by the European Medicines Agency. THIN also includes UK primary care data alongside health data from other European countries [6]. The European Health Data Space (EHDS) is a newly adopted regulation intended to establish a common framework for health data reuse in research and policy.

Federated approaches, as illustrated by the EUROlinkCAT study [24], enable secure local analyses with aggregation of results across partners. In Wales, the EUROlinkCAT cohort was derived from CARIS and linked primary care prescribing data through SAIL Databank [24]. In England, comparable registry infrastructure is provided by the National Disease Registration Service (NDRS), which manages the National Congenital Anomaly and Rare Disease Registration Service (NCARDRS). Rare disease representation within NCARDRS is limited and primary care linkage is not currently available. NDRS strategic priorities include expanding rare disease registration and developing algorithms to identify rare diseases in routinely collected datasets. Consistent with this, the Registration of Complex Rare Diseases—Exemplars in Rheumatology (RECORDER) project validated HES ICD-10 coded rare autoimmune diseases against clinical records [70].

Progress in this field must align with principles of equity and patient and public partnership. The absence of ethnicity reporting in most studies likely reflects historically incomplete recording in primary care. Ethnicity recording has improved and consolidating information from multiple linked data sources increases completeness [13]. Socioeconomic measures available within UK primary care databases are informative, but area-level measurements have recognised limitations. Primary care EHR databases can support the identification of disparities in rare disease diagnoses, treatment and outcomes, contributing to the UK Rare Diseases Framework’s vision to improve the quality and availability of care and address health inequalities. No studies in the review reported patient or public involvement, highlighting an important gap in current research practice. Embedding PPIE throughout study design, implementation and dissemination will ensure research priorities reflect the experiences of individuals and families living with rare diseases.

In conclusion, UK primary care EHR databases provide routinely collected, population-based, longitudinal data that are linkable to national healthcare and mortality datasets. Their scale, scope and population representativeness support natural history studies for rare genetic diseases. Despite this, they are markedly underutilised. For many conditions, the limited availability of diagnostic codes in routinely collected health data is a major constraint. Strengthening clinical coding vocabularies, expanding data initiatives to achieve whole population coverage, establishing standard data linkage with NHS genomic medicine services, enabling federated analysis and embedding patient partnerships will be key to unlocking their full potential.