Introduction

Cancer is a significant health challenge globally, ranking as the second leading cause of death, years of life lost, and disability-adjusted life years (DALYs) according to the 2019 Global Burden of Disease study [1]. In Ireland, the burden of cancer is particularly high, with the European Commission identifying it as having the highest cancer incidence in the European Union in 2020. The National Cancer Registry Ireland (NCRI) estimated an annual average of 44,000, invasive cancer cases diagnosed between 2020 and 2022 [2,3,4]. This high incidence underlines the urgent need for effective cancer prevention and management strategies.

General Practitioners (GPs) are integral to the cancer care continuum, providing services that span prevention, early detection, treatment, survivorship, and end-of-life care. As the first point of contact for many patients, GPs influence patient outcomes through timely diagnosis and coordinated care [5]. The analysis of routinely collected healthcare data in primary care can identify risk factors for various cancer types, leading to more effective prevention strategies and reduced diagnostic intervals [6]. However, a significant gap exists in the integration and use of primary care data in cancer research. This study aims to help address this gap by cataloguing and evaluating Irish health data resources relevant to primary care cancer research.

Health data in cancer research

Healthcare data is fundamental for evaluating and improving cancer care. Reliable and detailed data allow researchers to understand disease patterns, identify risk factors, and evaluate the effectiveness of interventions [7].

In Ireland, cancer research utilizes various health data sources, including registry data from the NCRI, biobanks, health surveys, audits, and screening datasets from HSE national programmes. While registry data offer patient-level insights into cancer trends, the lack of primary care data limits a comprehensive view of the patient journey [8]. Biobanks support personalized medicine [9, 10], and non-individual level data from audits like National Audit of Hospital Mortality and the Irish Paediatric Critical Care Audit help assess healthcare quality [11, 12]. Additionally, socio-economic and environmental factors influencing cancer are explored through datasets from the Environmental Protection Agency (EPA) and large social studies such as The Irish Longitudinal Study on Ageing (TILDA) and the Survey of Lifestyle, Attitudes and Nutrition (SLÁN) [13,14,15,16].

Despite the wealth of data, challenges remain in collecting cancer-relevant healthcare data in Ireland, as 85% of hospital-based healthcare records are paper-based. This underlines the need for improved data infrastructure [17, 18].

International context and best practices

Internationally, there are multiple examples of how large analytical datasets enhance cancer research. In the UK, resources like the UK BioBank, Clinical Practice Research Datalink (CPRD), and OpenSafely contain linked healthcare records for millions of individuals, facilitating extensive research into cancer patterns and outcomes [19]. The USA’s Veterans Affairs Corporate Data Warehouse has been used to study cancer screening patterns and treatment efficacy, providing valuable insights into effective cancer care strategies [20,21,22,23]. The European Cancer Observatory combines national registry information from across Europe on cancer incidence, mortality, survival, and prevalence, offering a comprehensive resource for comparative research [24].

Large developing biobank initiatives, such as the UK’s Our Future Health project, aim to advance early diagnostic technology and preventive interventions by inviting citizens to provide biological samples and health information [25]. In Australia, the MedicineInsight database, a large-scale primary care database of longitudinal de-identified electronic health records (EHRs), has proven useful in tracking diagnosis rates for conditions such as melanoma skin cancer during the COVID-19 pandemic [26, 27]. In the Netherlands, the Julius General Practitioners’ Network (JGPN) database supports a wide range of research, including aetiological, diagnostic, and intervention studies [28, 29].

In Ireland, there is a need to define the existing cancer-relevant datasets so work can begin to optimise the use of this data in primary care cancer research.

Aims and objectives

This study aims to systematically catalogue and evaluate health data resources relevant to primary care cancer research in Ireland. The specific objectives are to:

  1. 1.

    Identify and characterise cancer-relevant health data sources in Ireland, focusing on primary care.

  2. 2.

    Appraise the accessibility of these data sources and identify gaps in data availability.

  3. 3.

    Recommend strategies for optimal utilisation of these data sources in future primary care cancer research.

Methods

This study employs a systematic approach to map Irish health data resources relevant to primary care cancer research. Our methods include an literature review and expert roundtable discussion.

Literature review

We identified relevant literature focusing on English language articles from journals and book chapters, sourced primarily from the PubMed database. The selection was based on a structured template tailored to our study, with no constraints on the publication date of the articles. This process involved all team members, who contributed to refining the literature list for manuscript development. The formal literature review was supplemented by a grey literature review (Fig. 1).

Fig. 1
figure 1

Search strategy.

Eligibility criteria

Research was included if it pertained to databases created in the Republic of Ireland that are relevant to primary care cancer research and broadly applicable to multiple research questions.

Databases not available in English were excluded. Datasets including data from Northern Ireland were omitted due to contextual and legislative differences. A list of identified Northern Irish datasets is included in Supplementary Appendix item 4.

Our search strategy, designed in collaboration with a medical librarian, is detailed in Supplementary Appendix Item 1.

Data extraction

We extracted information using a predetermined template (Supplementary Appendix Item 3) documenting characteristics like dataset types, collection methods, geographical coverage, accessibility and size. The extraction process was conducted through shared tables in Google Sheets.

Sampling and screening

Initial screening of references was performed using EndNote software, with a systematic dual Cochrane rapid review screening process applied [30] (Supplementary Appendix Item 1). Our strategy for full-text article retrieval and the independent review of these articles helped maintain a level of consistency in our selection process. All discrepancies in inclusion decisions were resolved through consensus agreement among the reviewers. Expert roundtable discussion and feedback.

After completing a preliminary analysis of potential data sources through our search strategy and literature review, we organised a roundtable discussion with key stakeholders and experts. Experts were initially contacted based on their involvement and expertise in health data cancer research. We utilising a snowball sampling strategy; these experts recommended additional informants, thereby expanding the resource pool for our study.

This group included two researchers & policymakers in Irish cancer research, one clinician and a patient and public involvement (PPI) representative. Emphasis was placed on the importance of data protection and patient confidentiality, ensuring ethical use of the data sources.

The roundtable was conducted via videoconference. Key points of discussion included the relevance of identified data sources for primary care cancer research, identifying challenges or limitations in their usage, and considering potential additions to our list.

In addition to the roundtable, the identified databases were circulated via email to a broader group of experts in primary care cancer research. Of the ten experts contacted, five responded. These experts were invited to provide feedback on the completeness and relevance of the listed data sources. Responses were collected, consolidated, and integrated into the final dataset catalogue.

Data synthesis

The identified datasets from the literature review and expert consultations were compiled into a catalogue. This process included categorizing datasets based on data type thematic relevance, and accessibility. The synthesis also involved cross-referencing findings from the expert roundtable to validate the completeness and relevance of the identified data sources, ensuring a comprehensive representation of Irish health data relevant to primary care cancer research.

Datasets were assigned codes. The codes denote the type of data within each dataset. The codes are (i)Indiv/Cancer; (ii) Indiv/Health; (ii) Indiv/Social; (iv) Indiv/Cancer/Bio; (v) Nonin/Audit; (vi)Nonin/Cencus; (vii) Nonin/Environmental; (iix) Indiv/Bio.

The datasets have been initially divided into either individual-level or non-individual-level patient data. Individual level datasets have been further subdivided into data that is cancer-specific, biobank data, health but not solely cancer-related data and social data. Social data relates to data that primarily related to demographic or social science data but which may also contain relevant health data. Non-individual is aggregate data that has been further subdivided into data from audits or health reporting, national census data and environmental data (which has no patient data present at all).

Furthermore, all data present has been assigned themes describing the nature of the data content in a more general way. (i) Disease registry data; (ii) Screening programme data; (iii) Small-scale/institutional data relates to smaller-scale data that is hosted at either one or two institutions with defined geographical or population boundaries relating to the jurisdiction of the institution, rather than national or multi-institutional data; (iv) Mixed data are datasets that contains both individual cancer-specific patient data and biobank data; (v) Historic data are datasets that are no longer active and have been integrated into other, live, datasets; (vi) Hospital data is data derived from hospital administrative records on hospital discharges; (vii) Cancer-specific biological data; are biobank datasets specific to cancer; (iix) Biobank denotes biological datasets that are not specific to cancer; (ix) Cencus data from the national census; (x) Non-clinical social science data; (xi) Environmental data; (xii) Health system audits and reporting; (xiii) Mortality data are datasets registering national mortality; (xiv) Miscellaneous datasets do not fit easily into a given category.

Results

Literature review and data source identification

A systematic search identified 6789 unique citations. Following screening, 274 full-text articles were reviewed, of which 32 met the inclusion and exclusion criteria. The grey literature review supplemented this with five relevant datasets. The expert roundtable discussions facilitated the identification of two additional datasets not captured in the literature review. This process resulted in a total of thirty-nine datasets.

In addition to the thirty-nine datasets described, three previously existing dataset repositories emerged as highly relevant to the primary care setting. A dataset is a single collection of related data, while a dataset repository is a collection that stores and organizes multiple datasets. The three identified repositories are: 1) Health Atlas Ireland contains over 1.7 million records covering areas such as demography, hospital activity, and mortality. Although this breadth of data is comprehensive, its restricted access limits usability for external researchers. 2) The Joint Irish Nutrigenomics Organisation (JINGO) datasets offer lifestyle and nutritional information on 7000 participants through the TUDA study, NANS, and MECHE, supporting research into the role of lifestyle factors and genomics in cancer outcomes. 3) The Irish Social Science Data Archive (ISSDA) hosts various datasets, including the All Ireland Traveller Health Survey (AITHS) and The Irish Longitudinal Study on Ageing (TILDA), which contribute demographic and social data applicable to primary care research (Table 1).

Table 1 ISSDA primary care cancer research relevant data sources

Data source themes and classification

The datasets were grouped based on their data type (individual-level vs non-individual-level) and thematic relevance (cancer-specific, biobank, health-related, social) (Fig. 2). Individual-level datasets are particularly notable, as they contain detailed clinical data at the patient level. The National Cancer Registry Ireland (NCRI), with over 600,000 records, is one of the most comprehensive sources, documenting cancer incidence, prevalence, and survival since 1991. Other individual-level datasets include national screening programmes such as BowelScreen, CervicalCheck, and BreastCheck, which provide large-scale data on cancer screening and early detection.

Fig. 2
figure 2

Dataset hierarchy in direct relevance to primary care cancer research.

33 of the 39 datasets identified were non-individual datasets. Non-individual-level datasets, such as the National Audit of Hospital Mortality (NAHM) and the Irish Paediatric Critical Care Audit (IPCCA), provide system-wide insights into healthcare delivery. These datasets focus on healthcare quality and outcomes, contributing to a broader understanding of the performance of the healthcare system in cancer care.

Table 2 provides a full summary of the identified datasets, detailing their data type, thematic focus, record numbers, controlling organisations, temporal coverage, and accessibility status. The median size of all datasets listed in the table is 9832 records. In the table, in respect to the “Number of records” column, this identifies the existing number of records as per the most recent available information at the time the research was conducted. N/A denotes information that was not openly accessible during the research process. Not all dataset characteristics originally included in the extraction tool feature in the table as for some characteristics, the available data was insufficient to support a meaningful analysis or accurate reporting.

Table 2 Full list of identified datasets

Dataset accessibility

Of the thirty-nine datasets catalogued, six are publicly accessible. These datasets are valuable resources for researchers, as they offer open access to national and environmental data. In order to access the other datasets, access must be specifically granted to approved researchers only.

Cancer type-specific data sources

The categorisation of datasets by cancer type reveals a range of data availability for specific cancers (Table 3). Breast cancer is particularly well-represented across various datasets, including the BreastCheck national screening programme and multiple biobanks, such as the Breast Cancer Ireland Biobank and BREAST PREDICT. These datasets cover various aspects of breast cancer care, from early detection to treatment outcomes. Similarly, IPCOR and BowelScreen provide data on prostate and bowel cancers, respectively.

Table 3 Datasets based on cancer body system

However, data availability for other cancers, such as pancreatic and neurological cancers, is more limited. Relevant to these cancer types are The National Pancreas Transplant Programme and the Paediatric Neuroblastoma Biobank.

Dataset utility across the cancer care continuum

The datasets have varying utility across different stages of the cancer care continuum. Individual-level cancer data provides the most direct applicability to primary care research, supporting analyses of patient pathways from screening and diagnosis to treatment and survivorship. Thematic mapping of datasets (Fig. 3) shows how they align with different stages of the cancer continuum.

Fig. 3
figure 3

Dataset utility along the cancer continuum of care.

Discussion

This study catalogued thirty-nine Irish health datasets relevant to primary care cancer research, providing a resource for researchers to understand the scope, accessibility, and thematic focus of available data.

Of note, only one dataset was identified that exclusively utilises primary care data. This was the PCRS (Primary Care Reimbursement Service), a non-individual-level dataset that tracks the financial reimbursements made to healthcare primary care healthcare providers. This highlights a significant gap in the health data landscape which limits the potential of future cancer research in the primary care space. Without primary care-specific datasets, it becomes challenging to evaluate diagnostic pathways, referral patterns, and the role of general practitioners in early cancer detection. This gap significantly limits the ability to understand how primary care contributes to timely cancer diagnosis and management.

Key findings highlight the availability of large-scale cancer registries such as the National Cancer Registry Ireland (NCRI), which provides comprehensive data on cancer incidence, prevalence, and survival. Screening datasets like CervicalCheck and BowelScreen contribute valuable insights into population-level screening uptake and diagnostic pathways. The thematic analysis shows a strong presence of data sources for breast, prostate, and bowel cancer research, while less common cancers, such as pancreatic or neurological cancers, are less comprehensively represented.

Context of existing research

The comprehensive mapping of Irish health datasets reveals a potential for advancing primary care cancer research, in line with international trends. Dataset linkage enables the combination of patient data from diverse sources, such as primary care records, hospital data, and socioeconomic information, providing a holistic view of patient care pathways and facilitating better-informed research and policy [31,32,33]. Dataset linkage refers to ‘the bringing together from two or more different sources, data that relate to the same individual, family, place or event’ [34]. Examples of this exist in Canada, where linked health and socioeconomic datasets have been utilised to develop a range of microsimulation models with a view to analysing policy and health management system resilience in the face of unexpected events [35].

In Australia, the utilisation of data linkage platforms has informed a wide range of cancer research and policy development. Victorian Comprehensive Cancer Centre Data Connect is a single platform for facilitating access to a range of Victorian health data sources [36]. Data connect consists of linked hospital and primary care data. Uses for Data Connect to date include the analysis of diagnostic intervals for lung and colorectal cancers [10, 37] and improving primary care diagnostic tools in the early detection of cancer [36].

Large datasets such as the CPRD and Data Connecthave paved the way for evidence-based practices and targeted interventions through facilitating research into risk profiling, new models of care delivery and innovative interventions. Research into the ethnic differences in hypertension management utilising these datasets show the potential for their integration to provide more tailored patient-centred care [38,39,40]. Diagnostic tools such as electronic clinical decision support tools (eCDSTs), developed and validated through these large datasets, have shown a potential to impact improvements in decision making related to cancer diagnosis and reduced time to diagnosis [41, 42].

The Irish healthcare context brings unique considerations for the application of these findings. Primary care is often the first point of contact for cancer patients, making it critical to have access to real-time, comprehensive data. Data from screening programmes such as CervicalCheck and BowelScreen are highly relevant for understanding population-level screening uptake and outcomes. However, the limited open access to cancer-specific datasets, along with fragmented data governance policies, creates barriers to research and limits the ability to address research gaps effectively. There is a clear opportunity to enhance data-sharing frameworks, which would align Irish practices with existing international data-sharing frameworks and maximise the utility of existing datasets.

Limitations

This study faces several methodological limitations. One significant limitation relates to dataset accessability. While we aimed to include a wide range of data sources, the accessibility and availability of these datasets varied, potentially leading to an incomplete representation of all relevant health data resources. There is the potential for selection bias in the identification of data sources, as our search strategies and inclusion criteria may have inadvertently favoured certain types of datasets. Additionally, while we sought to include a wide range of experts and stakeholders in our consultations, the scope of our expert input may have been limited by logistical and time constraints, potentially impacting the breadth of insights gathered.

Implications for policy, research, and practice

The findings have implications for policy, research, and clinical practice in Ireland.From a policy perspective, the development of an open-access data catalogue as a resource for primary care cancer research aligns with broader goals to improve health data governance and accessibility. Policymakers could leverage this catalogue to prioritise efforts in data integration, establish clearer access protocols, and enhance the utility of data sources for informing national cancer care strategies.

For researchers, the catalogue serves as a centralised reference to identify existing datasets, reducing time spent on data search and allowing greater focus on analysis and application. It also highlights the importance of multidisciplinary collaborations, as linking diverse data types—such as clinical records, biobanks, and socioeconomic data—can produce comprehensive research insights. The study also suggests areas where data collection could be improved.

In clinical practice, the insights from this study could support the development of targeted interventions for specific patient groups based on their primary care pathways. The potential development of clinical decision support tools (eCDSTs) based on integrated data, as evidenced in UK practices, could significantly improve diagnostic accuracy and timely referral for suspected cancers [40].

Recommendations for future research, practice, and policy

These recommendations are directly based on the findings from the expert consultation and data synthesis, outlined in the results section, addressing the gaps in cancer-relevant health datasets identified during the dataset mapping process and outlined. Based on these results, future research should focus on developing a user-friendly online platform to centralise access to these datasets, incorporating detailed metadata and access protocols to facilitate researchers’ work. There is a need to evaluate data quality indicators to understand the reliability and completeness of datasets for primary care cancer research applications. Monitoring the data using standardised metrics could enhance the reliability of the repository researches. Efforts to streamline ethical approvals, data-sharing agreements, and compliance with GDPR will be crucial to enhance data accessibility while protecting patient privacy. Additionally, exploring data integration possibilities across primary and secondary care settings is essential to ensure that linked datasets can provide a full view of the patient care pathway. Expanding data collection efforts to improve the representation of less prevalent cancers, as well as incorporating social determinants of health, would enrich the catalogue’s relevance for primary care cancer research.

Conclusion

This study provides a comprehensive overview of Irish health data resources, with a focus on their application in primary care cancer research. By characterising datasets according to their size, theme, coverage, and access conditions, it lays the groundwork for improved data integration and accessibility in Ireland. Implementing the recommendations put forth can guide future research directions, inform policy decisions, and support enhanced cancer care practices in primary care settings.