Nationwide study on development and validation of a risk prediction model for CIN3+ and cervical cancer in Estonia

Tisler, Anna; Võrk, Andres; Tammemägi, Martin; Ojavee, Sven Erik; Raag, Mait; Šavrova, Aleksandra; Nygård, Mari; Nygård, Jan F.; Stankunas, Mindaugas; Kivite-Urtane, Anda; Uusküla, Anneli

doi:10.1038/s41598-024-75697-3

Download PDF

Article
Open access
Published: 19 October 2024

Nationwide study on development and validation of a risk prediction model for CIN3+ and cervical cancer in Estonia

Anna Tisler¹,
Andres Võrk²,
Martin Tammemägi³,
Sven Erik Ojavee⁴,
Mait Raag^1,5,
Aleksandra Šavrova⁶,
Mari Nygård⁷,
Jan F. Nygård⁷,
Mindaugas Stankunas⁸,
Anda Kivite-Urtane⁹ &
…
Anneli Uusküla¹

Scientific Reports volume 14, Article number: 24589 (2024) Cite this article

3205 Accesses
3 Citations
Metrics details

Subjects

Abstract

Transitioning to an individualized risk-based approach can significantly enhance cervical cancer screening programs. We aimed to derive and internally validate a prediction model for assessing the risk of cervical intraepithelial neoplasia grade 3 or higher (CIN3+) and cancer in women eligible for screening. This retrospective study utilized data from the Estonian electronic health records, including 517,884 women from the health insurance database and linked health registries. We employed Cox proportional hazard regression, incorporating reproductive and medical history variables (14 covariates), and utilized the least absolute shrinkage and selection operator (LASSO) for variable selection. A 10-fold cross-validation for internal validation of the model was used. The main outcomes were the performance of discrimination and calibration. Over the 8-year follow-up, we identified 1326 women with cervical cancer and 5929 with CIN3+, with absolute risks of 0.3% and 1.1%, respectively. The prediction model for CIN3 + and cervical cancer had good discriminative power and was well calibrated Harrell’s C of 0.74 (0.73–0.74) (calibration slope 1.00 (0.97–1.02) and 0.67 (0.66–0.69) (calibration slope 0.92 (0.84–1.00) respectively. A developed model based on nationwide electronic health data showed potential utility for risk stratification to supplement screening efforts. This work was supported through grants number PRG2218 from the Estonian Research Council, and EMP416 from the EEA (European Economic Area) and Norway Grants.

Towards a data-driven system for personalized cervical cancer risk stratification

Article Open access 15 July 2022

Italian guidelines for cervical cancer screening. Multisocietal recommendations on the use of biomarkers in HPV screening with risk-based approach and GRADE methodology

Article Open access 11 September 2025

Determinants of cervical cancer screening utilization among women in Southern Ethiopia

Article Open access 01 September 2022

Introduction

Public health initiatives aim to eradicate cervical cancer by the year 2030. As of 2020, 22 (82%) European Union (EU) Member States had integrated population-based screening programs for cervical cancer into their National Cancer Control Plans¹. While organized screening programs have been instrumental in reducing cervical cancer incidence and mortality, they often rely on a standardized, age-based strategy applied uniformly across the population². The issues associated with a “one-size-fits-all” screening approach include suboptimal attendance, over-screening, and significant disparities in the utilization of cancer screening services³. These limitations underscore the urgent need for more effective, tailored screening programs that account for individual risk factors. Cervical cancer screening offers an excellent framework for personalized risk assessment due to well-established disease patterns and key risk factors: persistent high-risk HPV infection, age, sexual history, oral contraceptive use, smoking, and screening non-attendance⁴. Risk-stratified screening has emerged as a concept in which decisions to offer screening or the determination of screening frequency and modality (screening test(s)) are guided by accurate estimation of an individual’s risk of cancer⁵. Risk-based screening aims to optimize benefits (reducing cancer-related deaths) while minimizing potential harms (excess screenings, false positives, and overdiagnosis).

Over the past decade, numerous cervical cancer predictive models have been developed⁶, but a systematic review reveals persistent challenges, including methodological inconsistencies, limited population representativeness, and small sample sizes that hinder generalizability. Electronic health data-based models have the potential to address these issues. By offering a comprehensive and detailed view of individual health histories and current statuses, electronic health data could significantly enhance the accuracy of risk assessments. Additionally, the real-time availability of this data allows for frequent updates to models, ensuring they reflect the most current information and emerging health trends. The extensive scale of electronic health data also facilitates the inclusion of diverse patient populations in model training, improving generalizability and overcoming limitations related to sample size and population diversity. Thus, leveraging electronic health data could provide a valuable approach to overcoming the challenges faced by existing predictive models.

This study aimed to develop and validate a model that predicts and evaluates risk over an 8- and 5-year horizon of cervical intraepithelial neoplasia grade 3 or higher (CIN3+) and cancer in the adult female population using nationwide, linked electronic health data.

Setting

In Estonia, an organized cervical cancer screening program utilizing Pap tests (cytology) every five years was initiated in 2006, targeting women aged 30–55 years. The participation rate in this screening program has been suboptimal, with attendance consistently falling below 50%⁷. Over recent decades, there has been a shift towards diagnosing cervical cancer at more advanced stages⁸. Approximately 90% of cervical cancer cases in Estonia have been detected outside of routine screening and through testing symptomatic women. This has had a minimal effect on the estimated age-standardized incidence rate of cervical cancer with 14.4 cases per 100,000 women during the period 2014–2018⁹ which is roughly twice as high as those estimated for Western Europe (6.8 per 100,000), Northern America (6.4 per 100,000), and Australia (6.0 per 100,000)¹⁰.

Despite nearly two decades of the national cervical cancer (CC) screening program, the CC incidence in Estonia remains one of the highest in Europe. Given the availability of comprehensive nationwide electronic health data, developing a risk prediction model for CC in Estonia is essential. Such a model could enhance the effectiveness of the screening program by identifying high-risk individuals who would benefit from more frequent and targeted screening, potentially improving early detection rates and reducing the overall burden of the disease. The HPV vaccination program commenced in 2009 for girls aged 12 to 14, and since 2024 the also includes boys.

Methods

Study design

In this retrospective modelling and internal validation study, data for model development and internal validation were derived from the following Estonian health registries: data from the Estonian Health Insurance Fund (EHIF)¹¹, Estonian Cancer Registry (ECR), and Estonian Medical Birth Registry (EMBR) were employed (Supplementary Tables 1 and data source description). These are national health data sources that can be linked using unique personal identification codes. Data spanning from 2005 to 2012 were utilized to develop a risk-based model using routinely collected electronic health data. The eight-year period was chosen to ensure sufficient time for collecting and evaluating relevant predictors. Data on all women born in 1988 or earlier (aged ≥ 16 years on the 1st of January 2005) in the Estonian Health Insurance Fund (EHIF) were followed from the 1st of January 2005 until the 31st of December 2012.

The model was then validated over eight-year period (2013–2020), chosen to extend the national screening recommendation interval of five years. The model validation was made from Jan 1, 2013, to December 31, 2020, and the data excluded all women with a previous indication of cervical and uterine cancer.

Data on medical/health history were supplemented with sociodemographic and reproductive history (Supplementary Table 1) and were employed to create a prediction model for two outcomes: CIN3 + and cervical cancer (Fig. 1). Model development and validation were performed following the clinical prediction rules and guidelines¹².

Data sources

Estonian Health Insurance Fund (EHIF)¹¹ provides universal public health insurance since 2001 and covers > 95% of the Estonian population (95.8% of the female population aged 16 or older in 2021)¹³. EHIF maintains a comprehensive healthcare and prescription pharmaceutical database, including personal data (sex, year of birth) and healthcare utilization (services provided, date of service, primary and secondary diagnoses, inpatient and outpatient treatments) The diagnoses are presented using the International Classification of Diseases, Tenth Revision (ICD-10), while medical services are coded using the Nordic Medico-Statistical Committee (NOMESCO) codes. The prescriptions database contains detailed information about all prescribed and purchased medications and vaccines. Full electronic data from the EHIF have been available since 2005.

The Estonian health insurance system adheres to universal coverage principles, and individuals’ insurance status may change based on various factors, such as employment and residency status.

Estonian Cancer Registry (ECR) is a population-based registry in operation since 1978 containing complete and reliable registration of incident cancer cases. In Estonia, reporting cancer cases is compulsory for all physicians who treat and diagnose cancer. The validity of Estonian Cancer Registry data is at a favorable international level¹⁴.

Estonian Medical Birth Registry (EMBR¹⁵) was established in 1991 to collect data on all births in Estonia. All maternity units in Estonia are obliged to notify births to the EMBR. The notification form includes the personal identity number of the mother, and information about maternal socio-demographics, health behaviour and health before and during pregnancy. Data on births before 1991 are not available. The study omitted the years 1991–1994 due to incomplete records resulting from the gradual issuing of personal identification codes (PIC-s) to individuals in the early 1990s. Consequently, the analysis concentrated solely on complete records beginning from 1994 onward.

In Estonia, unique 11-digit PIC-s are assigned to all residents at birth or at the time of immigration. PIC as a single unique identifier is recorded accurately in all three data sources used in this study, enabling a straightforward and complete linkage of study population information between the registries.

Study population

The study population consisted of all women born ≤ 1988 identified from the EHIF data. Women who died before 1 January 2013, or with no information regarding health insurance as a predictor during the development period (2005–2012) or validation period (2013–2020), or those considered not at risk of cervical cancer such as women with a history of cervical cancer or those who had undergone total hysterectomy, including uterine cancer were excluded from the risk model.

For the analysis, two cohorts were established: Cohort 1, which included all women born ≤ 1988, and Cohort 2, consisting of women born between 1977 and 1988. The rationale for developing a separate model for the younger cohort was based on the timing of the validation period, where these individuals would be entering the screening age. Furthermore, the birth registry data starting from 1994 for Cohort 2 is more comprehensive than that available for women in Cohort 1.

Predictors (medical/health and reproductive history, socio-demographic characteristics)

The predictors incorporated were based on previous research on risk factors for cervical cancer^16,17. For both cohorts (Cohort 1: all women; Cohort 2: younger women) the data on socio-demographics (year of birth, health insurance status), cervical cancer screening participation (PAP tests), systemic hormonal contraception use, data on diagnosed sexually transmitted infections (STIs) were derived from EHIF. Additional data on the number of births, smoking history (ever during the development period), and education from the EMBR. Variables definitions and data sources are provided in Supplementary Table 1.

Modelling outcomes

Cervical intraepithelial neoplasia 3 or more severe cases (CIN3+) were primary outcomes, considered the most reliable surrogate marker for cervical cancer risk. Cervical cancer was considered a secondary outcome (Supplementary Table 1). We did not perform a formal sample size estimation as we utilized nationwide data on all CIN3 + and cervical cancer cases.

Our sample size (outcome counts) aligns well with the recommendation to have at least 10 events per variable, which minimizes bias and ensures predictive accuracy in Cox proportional hazards models¹⁸.

Model derivation

The Cox proportional hazards model was employed to predict study outcomes up to 8 years post-development, with the index date set as January 1, 2013. Additionally, we predicted 5-year outcome risks using the same model, evaluating its performance within this interval following the national cervical cancer screening guidelines. The rationale for these specific timeframes is twofold: the 5-year interval aligns with Estonia’s recommended cervical cancer screening interval, while the 8-year interval reflects the potential extension of screening intervals. To select the predictors for the Cox models, we employed the Least Absolute Shrinkage and Selection Operator (LASSO) method (see results in Figs. 3 and 4), while using a 10-fold cross-validation approach. For every cohort and outcome combination, we identified the variables from the penalized Cox models in which the estimated lambda yielded the highest out-of-sample Harrell’s C statistic. For both outcomes, we present separate final Cox models for Cohort 1 (all women born ≤ 1988) and Cohort 2 (younger women born 1977–1988) as their coefficients and likelihood ratio test statistics (Supplementary Tables 4, 5), a total of four models are reported. Additionally, separate results for models that were fitted using all predictors and only those generated from EHIF data are reported in Supplementary Table 4.

We did not impute missing data due to their non-random absence. The rationale behind this approach was to construct a model that mirrors the actual real-world scenario. Missing data are frequently encountered in the context of routinely collected health information¹⁹, and such missingness often carries informative implications. We addressed this issue by including specific predictor variables that incorporate a category for ’Not available’ as one of the values (Table 1).

Model performance

The statistical performance of risk prediction models was assessed by discrimination, calibration and clinical utility²⁰. Our study employed 10-fold cross-validation for internal validation. This procedure allowed us to generate out-of-sample predictions for the linear index of a Cox model and predicted risks, facilitating the assessment of discrimination and calibration in our analysis. Discrimination (classification accuracy) was assessed using a time-dependent area under the receiver operating characteristic curves (AUROC), employing inverse censoring probability weighting over a 5 and 8-year timeframe. The developed model’s discriminatory performance was measured by Harrell’s C -statistic (ranging from 0 to 1, with value higher than 0.75 demonstrating useful discrimination²¹. 95% confidence intervals (CI) are provided. Calibration plots (Supplementary Figs. 3–14) in deciles were used to examine the agreement between model-predicted and observed probabilities and report calibration slopes. Finally, we evaluated the clinical utility of the prediction models using a decision curve analysis²². Net benefit serves as a metric to evaluate the pros and cons of using a model for clinical decision support and for conducting impact studies. We report a range of thresholds at which the model demonstrated a net benefit.

All analyses were conducted using SAS 9.3 (Cary, NC) and R 4.2.3 (https://cran.r-project.org/).

Ethics

This study was approved by the ethical review board at the University of Tartu (protocol number: 3320/M-7, 21.12.2020) which waived the requirement to obtain informed consent. We followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) checklist to ensure transparent reporting²³. The whole research was performed in accordance with the relevant guidelines and regulations.

Results

Study cohort and development period (2005–2012)

Using the EHIF database we identified 633 255 women born ≤ 1988. Of those, 18.2% (n = 115 371) were excluded from the analysis (Fig. 4). Our study sample contained data on 517 884 women born in 1988 or earlier (Cohort 1) and those born between 1977 and 1988 (21.0%, n = 109 009) were identified as Cohort 2 (Fig. 2).

The participants’ characteristics during the development period 2005–2012 are presented in Table 1. The mean age of the study population on December 31, 2012, was 54.2 (range 25–109 years) and 30.03 (range 25–36 years) years in Cohort 1 and Cohort 2 respectively. In Cohort 1, 41.8% had no PAP test for 8 years (PAP test coverage = 0), HIV, HPV (genital warts), and genital chlamydia infections were diagnosed in 0.2%, 0.9%, and 2.2% of women respectively.

Table 1 Study population characteristics up to the validation period.

Full size table

Study outcomes during the validation period (2013–2020)

In Cohort 1 over the validation period of 8 years, a total of 1326 cervical cancer cases were diagnosed among 517,884 women (cumulative incidence of 0.26%). With 3,897,120 person-years of follow-up, the incidence rate was 34 per 100,000 person-years. The total number of CIN3 + cases identified was 5929 yielding a cumulative incidence of 1.14% and an incidence rate of 152 per 100,000 person-years (Supplementary Table 2). In the validation period, the mean age at diagnosis was 59 years for women with invasive cervical cancer and 46 years for women with CIN3+. In Cohort 2 during follow-up of 857 439 person-years CIN3 + and cervical cancer were diagnosed in 2697 (2.5%) and 172 (0.16%) women (incidence rates being 314 and 21 per 100,000 person-years respectively) (Supplementary Table 2). Similar data for the 5-year horizon is reported in Supplementary Table 3.

Cohort 1

Adjusted HR from multivariable models fitted separately for cervical cancer and CIN3 + are illustrated in Figs. 3 and 4. In the final model for Cohort 1 a higher risk of CIN3 + was observed for those diagnosed previously with HIV, HPV and genital chlamydia. In addition, long-term hormonal contraceptive use, younger age, smoking and previously diagnosed cervical neoplasias were significant predictors with the strongest associations noted for those with previous CIN3 diagnoses (HR 13.33; 95% CI 12.14–14.64) and those born in 1983–1988 (HR 7.21; 95% CI 5.65–9.19). Having health insurance and a history of PAP testing were protective factors. The risk for invasive cervical cancer was significantly increased among those with previous cervical neoplasias, especially CIN3, those living with HIV and the increasing number of births. PAP test coverage, being insured and higher education were inversely associated with cancer risk.

Cohort 2

For the younger cohort previous HIV, HPV, long-term contraceptive use, genital chlamydia infections and smoking had significantly increased risk for CIN3+. Tertiary education and health insurance coverage were inversely associated with CIN3+. In the case of cervical cancer as an outcome following risk predictors were identified: history of CIN3, number of births, and HIV. Similar to the CIN3+, having health insurance and tertiary education were protective factors.