An open source statistical web application for validation and analysis of virtual cohorts

Ohmann, Christian; Khorchani, Takoua; Cracanel, Alexandru; Brüning, Jan; Verde, Pablo Emilio

doi:10.1038/s41598-025-99720-3

Download PDF

Article
Open access
Published: 06 May 2025

An open source statistical web application for validation and analysis of virtual cohorts

Christian Ohmann¹,
Takoua Khorchani²,
Alexandru Cracanel³,
Jan Brüning⁴ &
…
Pablo Emilio Verde⁵

Scientific Reports volume 15, Article number: 15744 (2025) Cite this article

3612 Accesses
Metrics details

Subjects

Abstract

The conventional approach to developing medical treatments and medical devices usually covers pre-clinical and in-vitro investigations, in-vivo animal studies and clinical trials with humans. In-silico trials and virtual cohorts present a promising avenue for addressing the challenges inherent in clinical research and improving its efficiency. Despite considerable advancements in the field of in-silico trials, several notable gaps and challenges still need to be addressed, one is the limited availability of open and user-friendly statistical tools to support the specific analysis of virtual cohorts and in-silico trials. In the EU-Horizon funded project SIMCor we have developed a web application, providing a R-statistical environment supporting the validation of virtual cohorts and the application of validated cohorts for in-silico trials. It provides a practical platform for validating cohorts and has implemented existing statistical techniques that can be applied to compare virtual cohorts with real datasets. It is fully open, generic and menu driven and provides user guidance and help (https://github.com/ecrin-github/SIMCor, https://zenodo.org/records/14718597).The tool has been developed according to specified user requirements and has been extensively tested and validated. Important next steps are to gain more experience with the tool in other domains and research environments and to extend its functionality.

Validation of an interactive process mining methodology for clinical epidemiology through a cohort study on chronic kidney disease progression

Article Open access 14 November 2024

Approaches to protocol standardization and data harmonization in the ECHO-wide cohort study

Article Open access 16 February 2024

An analysis of alternative forced oscillation technique reporting and validation methods for within- and between-sessions in healthy adults

Article Open access 30 July 2022

Introduction

An in-silico clinical trial, also known as a virtual clinical trial, is an individualised computer simulation used in the development or regulatory evaluation of a medicinal product, device, or intervention^1,2. The conventional approach to developing medical treatments and devices typically commences with pre-clinical and in-vitro development, followed by in-vivo animal models, and then clinical trials to evaluate the product’s viability for human use. In-silico trials and virtual cohorts, which are de-identified virtual representations of real patient cohorts, present a promising avenue for addressing the challenges inherent in clinical research, such as long durations, high costs, as well as ethical implications, and improving its efficiency. Literature reports suggest that in-silico clinical trials can help to reduce, refine, and partially replace real clinical trials by reducing their size and duration through better design, refining clinical trials with clearer, and more detailed information on potential outcomes, by enhancing the understanding of how the tested product works^3,4,5. Under appropriate conditions, in-silico trials can also partially replace clinical trials⁶. In addition, modelling and simulation is used to reduce, refine and replace animal experimentation and even to replace bench tests². However, it must be noted that full adoption of this potential will also require changes in legislation next to efforts in academic and industrial research, as animal trials are, to this day, a fundamental aspect of the pathway towards regulatory approval of medical devices. That this full potential can be achieved has been demonstrated by the wider application of in-silico clinical trials in pharmacological research, where regulatory acceptance of this novel technology is already strong⁷. Here, in-silico trials have been shown to be able to predict toxicity and safety and even being more efficient than animal trials¹.

Furthermore, in-silico trials considerable time and cost savings can be achieved. This was exemplarily demonstrated by the VICTRE study, where only one third of the resources were required to design and complete a comparative trial. The comparative trial took approximately 4 years to complete, versus 1.75 years for the design, implementation, and execution of the VICTRE trial⁸. Similarly, the FD-PASS trial investigating safety and efficacy of flow diverter devices has been successfully replicated using in-silico models. In addition to the above-mentioned benefits, the respective study found, that the in-silico trial was able to provide more information and insights regarding treatment failure than its conventional counterpart⁵. Currently, proper quantification of potential cost savings is challenging as information regarding costs of clinical trials are often confidential. However, the demonstrated effect of time saving and therefore potentially reducing the time to market is hugely beneficial, as it will ultimately result in earlier generation of revenue, which is considered largely beneficial to the medical device industry⁹.

Despite the promising advancements in the field of in-silico trials, several notable gaps and challenges still need to be addressed, hindering their wider adoption in healthcare and clinical research. These include technological limitations and advances, need for clarity on model evaluation, the unmet need for regulatory guidance, and poor communication between stakeholders.

For performing in-silico trials, adequate and powerful statistical tools are needed for planning and analysis. Several toolboxes and platforms are currently available. Some of them are commercial (e.g., InSilico trial platform with many services to support drug development¹⁰), while others are open-source and (partially) freely available (e.g., The QSP (quantitative systems pharmacology) Toolbox¹¹, Universal Immune System Simulator (UISS)¹², Simulo¹³). In addition, clinical trial simulators, such as the Highly Efficient Clinical Trials Simulator (HECT), may be valuable for designing in-silico trials¹⁴. While these tools are promising, certain limitations could reduce their applicability and usefulness.

The importance of virtual cohorts and in-silico techniques was investigated in the SIMCor project. SIMCor is a three-and-a-half year (January 2021–June 2024) EU-Horizon 2020 research and innovation action developing a computational platform for in-silico development, validation and regulatory approval of cardiovascular implantable devices¹⁵. The platform, composed of a virtual cohort generation and validation domain, a device implantation and effect simulation domain, and equipped with a variety of in-silico modelling resources, represents an open environment for collaborative R&D (research & development) among device manufacturers, researchers, medical authorities and regulatory bodies. In the project, in-silico technologies are applied to two clinical use cases: the simulation of a Transcatheter Aortic Valve Implantation (TAVI), and a Pulmonary Arterial Pressure Sensor (PAPS)¹⁶.

In this project, a general methodological framework for assessing the impact of in-silico methods and technologies on clinical and preclinical trials was developed. An essential part of this framework involves a statistical environment for planning and analysing virtual cohorts and in-silico trials. After careful evaluation, it was concluded that existing tools and platforms only partly fulfill the needs of the project, leading to the decision to develop and implement a more generic and open statistical environment for analysis of in-silico trials. The tool is based on multi-functionality, adequate software and hardware architecture, and high levels of automation. It is designed to support the use cases in SIMCor but also to be applicable to other use cases and domains. This paper is dedicated to the development, testing/validation, and application of this tool in the SIMCor project.

Methods

Survey on existing tools

To prepare the development of the R-Statistical environment for application in virtual cohorts and in-silico trials, a survey about existing R-tools for computational modelling has been performed. Only R-packages that were listed in CRAN (Comprehensive R Archive Network) were included. The search was restricted to R-packages from CRAN for the following reasons:

Statistical software free available and open source
Structured and proofed quality management in software development
Flexible integration into low level computer languages (e.g., C + + , Fortran) and scripting languages for data science (e.g., Python, Mathlab, Julia)
Support of graphical user interfaces like web applications
Scalability to dynamically implement further complex statistical procedures into the software

The search was performed in October 2023 with the list of CRAN packages by name (https://cran.r-project.org/web/packages/available_packages_by_name.html). CRAN packages referring to “clinical trials” (respectively “platform trials”, “adaptive trials” as subcategory) were filtered (n = 94) and then manually searched with respect to “simulation” in the title. In total, 8 applications could be identified, as relevant for computational modelling (supplementary material S1). The packages were assessed and information consented with respect to name, short description, URL, open source, licence and target by two co-authors of the manuscript (CO, PEV). Close consideration of the existing tools revealed that currently no R-package exists that fully covers the needs for analysing data related to virtual cohorts and in-silico trials. These needs were discussed in a workshop described below. For that reason, the decision was taken to develop a web application and templates for planning and analysis of in-silico trials, allowing validation of virtual cohorts as well as application of validated cohorts in in-silico trials.

Statistical environment and requirements

A virtual workshop entitled “Statistical environment for in-silico trials” was organised by SIMCor and took place on 6 December 2021 with around 50 participants, including representatives from other in-silico projects (In-silico World¹⁷, SimInSitu¹⁸, SimCardioTest¹⁹). The objective of the workshop was to discuss and specify the requirements for the implementation of a statistical analysis environment for planning and managing in-silico trials and to explore implementation strategies together with experts from in-silico research as well as experienced biostatisticians, data managers, and data scientists (Supplementary Material S2).

Following the workshop, the decision was taken to develop the statistical environment with R. R-Markdown and Shiny. The combination of R with R Markdown and Shiny packages provides a widely used and user-friendly ecosystem for planning, analysis, and reporting in-silico trials within a replicable research environment. Detailed arguments for using this environment are given in the supplementary material S3 and in the section on “Survey on existing tools”. R-packages in CRAN are issued under the GNU-2 licence. The GNU General Public Licenses (GNU GPL or simply GPL) are a series of widely used free software licenses, or copyleft licenses, that guarantee end users the freedom to run, study, share, and modify the software. In the project, other packages from CRAN have been used, which are under the GNU-2 license. The R-statistical environment developed is also issued under the GNU-2 licence.

In the following the web application developed with Shiny is named R-statistical environment.

Practical software development started with the provision of user stories. User stories are tokens representing “atomic” needs and functionalities expected from the system. It includes information on how the stakeholder describes the steps or workflow to solve each part of the problem addressed. In our project, user stories were defined based on the requirements identified in the initial workshop, and the implementation was done in iterations/sprints, where the development team delivered incremental features and received constant feedback from main stakeholders (Agile methodology, supplementary material S4).

Statistical algorithms to be implemented

The analytical techniques to be implemented in the R-statistical environment are described in detail in a document “general model” (version 2, December 2023, supplementary material S5). An overview is given in the figure below.

Two major areas are covered: validation of virtual cohorts on real datasets and application of validated cohorts in in-silico trials.

Results

Implementation

The resulting interactive web application with R was developed in the R programming language, based on the Shiny package, and is openly available as version 0.1.0 (https://github.com/ecrin-github/SIMCor). It is intended to support proof-of-validation for virtual cohorts and computer-based simulations. It offers a series of standard analytic techniques that can be applied to compare virtual cohorts with real datasets, thus supporting the process of validation of a virtual cohort (one-, two- and multivariate comparisons). Furthermore, it provides different options to apply validated virtual cohorts in in-silico trials. This covers a one-group assessment (no control) and two-group comparisons together with options on statistical design and sample size calculations (Fig. 1). The tool is generic and menu driven and it provides user guidance and help. Additional information on software development is documented in S6 Development history of the application and R-packages integrated in the R-statistical environment.

General aspects

Validation and application of virtual cohorts are related to the Context of Use (CoU) and the Question of Interest (QoI). Definitions for these terms are taken from the FDA Guidance on “Assessing the credibility of computational modelling and simulation in medical device submissions”²⁰. For models used in in-silico device testing or in in-silico trials, the CoU should describe how the model will be used in a simulation study to address the QoI. The QoI defines the specific and concrete question related to the CoU. As such CoU and QoI are prerequisites for any kind of validation or application activity directed at virtual cohorts or in-silico trials. CoU and QoI are essential elements for the scope and role of the computational model and the specific question addressed. Therefore, documentation of this information is necessary for the application of the R-statistical environment.

Implemented functionality

Validation of virtual cohorts

To apply the module “validation of virtual cohorts”, as a first step, the CoU, and the QoI must be specified as free text and submitted (by clicking the boxes).

The next step deals with the upload of datasets for validation including both the virtual dataset and the real dataset to be compared to. For the import, any accessible computer can be browsed, and a specific file selected. Virtual and real datasets must be uploaded as a CSV file with the same variable structure. After uploading, the following analysis can be performed (see: Fig. 2 with screenshots and Table 1):

Table 1 Analytical techniques implemented in the R-statistical framework.

Full size table

Univariate comparison

From the imported data sets all, several or one specific variable is selected and separately for the virtual and real datasets descriptive statistics are calculated and presented:

Mean value, standard deviation, minimum and maximum
Scatter plots of combinations of variables

The results are presented as tables with variables as rows and separately for virtual and real data. In addition, the results are shown as box plots (see Fig. 2a).

Bivariate comparison

Here, separately for the real and virtual dataset, bivariate correlations between the variables are calculated. The idea is to compare the correlations within the two cohorts, to not only assess whether parameter distributions, but also their relations are adequately mimicked by the virtual cohort.

The results are graphically displayed as so-called heatmaps. Correlation heatmaps are a type of plot that visualise the strength of relationships between numerical variables. Correlation plots are used to understand which variables are related to each other and illustrate the strength of this relationship (see Fig. 2b).

Multivariate comparison

To evaluate the compatibility of the virtual cohort with the real data, a multivariate comparison between the n-dimensional distributions of the features of both cohorts can be performed in the R-statistical environment (see Fig. 2c). The following test is used:

Quantile–Quantile plot between the synthetic and the real data after multivariate standardisation of each data set. Multivariate standardisation is performed by (1) subtracting the vector of means to each vector data point and (2) scaling by using the inverse of the variance covariance matrix. The resulting standardised quantity is a quadratic form that under the assumption of multivariate normality, it follows a chi-squared distribution with degree of freedom equal to the number of variables minus one. The histogram of these quadratic forms are displayed for the real and virtual data. The methodology is explained in the supplementary material S5.

Variability assesment

In this validation approach, the results from the model (virtual dataset) and the experiments (real dataset) are plotted together using a nonparametric density function for a variable of interest. The uncertainties associated with the model (arising from input uncertainties and numerical uncertainties) and the experiment (stemming from measurement system uncertainty and specimen-to-specimen variability) are represented in the two distributions. Any discrepancies between the two curves are thus interpreted as uncertainty in the model that generated the virtual data.

The uncertainty in the model is assessed by a bootstrap technique. The real data is taken as a reference distribution, and bootstrap samples are generated from the virtual dataset. For each bootstrap sample, a nonparametric density function is calculated, and 95% confidence bounds for the density function are calculated. If the real data is not covered by the 95% confidence bounds created from the virtual dataset, then a deviation of the model has been detected.

In addition, a bootstrap p-value can be calculated from the number of times that the density of the real data is out of the 95% confidence bounds of the bootstrap analysis. The results are graphically displayed as density functions (see Fig. 2d). The methodology is described in supplementary material S5.

Application of validated cohorts

The following analytical techniques are implemented in the R-statistical environment (see Fig. 3 with screenshots and Table 1):

One-group design

For the 1-group design (i.e. only one validated cohort and no control), the dataset for the analysis should be imported as, currently, CSV files. For the import, any accessible computer can be browsed, and a specific file selected. Then the variable of interest should be specified by selecting from the list of all variables of the dataset imported. In the next step, the type of variable needs to be specified (discrete, continuous, time to event) (see: Fig. 3a).

If “perform analysis” is clicked, then there are 3 options available:

Data (default)
Plot

Under “Data” the individual records belonging to the dataset can be browsed.

“Plot” provides a figure reporting frequency and a chi-square test for a discrete variable, a boxplot for a continuous variable, and a Kaplan–Meier curve for a time-to-event variable.

Two-group design

For the 2-group design, the two datasets for the analysis should be imported as, currently, CSV files. For the import, any accessible computer can be browsed, and two datasets should be selected sequentially. The structure of the two datasets should be the same. Again, and like the 1-group design, the variable of interest should be specified by selecting it from the list of all variables of the datasets imported. In the next step the type of variable needs to be specified (discrete, continuous, time-to-event) (see Fig. 3b).

If “perform analysis” is clicked, three options are available:

Data (default)
Plot
Analysis

Under “data” (default option), the individual datasets can be browsed.

For a discrete variable, “plot” provides a figure reporting frequencies for the two datasets. For a continuous variable, boxplots of the two datasets are presented. For a time-to-event variable, the two Kaplan–Meier curves are shown in one figure.

The function “analysis” covers a chi-square test for a discrete variable, comparing the two datasets. For a continuous variable, a t-test is presented and for a time-to-event variable a Kaplan-Meier test.

Sample size estimation

The module on sample size calculation is currently restricted to a two-group comparison and a continuous outcome variable with a common standard deviation between the groups. For application of this function, the mean for group 1, the mean for group 2, the common standard deviation, the significance level, the power, and the hypothesis type (one- or two-sided) need to be entered.

For taking uncertainty in the estimation of effect size into consideration, a lower bound (minimum 0) and a higher bound (maximum 1) can be defined. Finally, the number of scenarios can be specified with the slider. If all information is entered, a summarising statistic for the calculated sample size with minimum, median, mean and maximum is given. Probabilities for achieving a power less than 90% for different sample sizes are calculated and presented (see Fig. 3c).

For any individual statistical algorithm applied in an analysis within the R-statistical environment, a report in pdf-format is available.

Testing and validation

Testing of the R-statistical environment covered inspection of the Shiny code, module testing, and integration testing of the full application. A senior statistician was involved in this process (PEV).

In addition, the tool application was validated against concrete examples:

Validation of virtual cohorts

Verstraeten et al. developed a virtual cohort generator that creates anatomically plausible, synthetic geometries of stenosed aortic valve geometries for use in virtual TAVI trials²¹. This generator was constructed utilising an approach that combines non-parametric statistical shape modelling with sampling from various distributions. It was able to produce 500 synthetic aortic valve stenosis geometries, which were then validated by comparison with a real-life cohort of 97 patients, confirming the validity of the virtual cohort. The R-Statistical environment resulted in reproduction of the original results and allowed further extension of the study’s analysis. The data used in this study, including the examples, are publicly accessible on 4TU.ResearchData²² and on ECRIN’s GitHub repository²³, specifically under the filenames “shapeFeatures_real.csv” and “shapeFeatures_synthetic.csv”.

Application of validated cohorts

The data used for validation was generated through simulations for a study comparing two devices, labelled as Device A and Device B, across three types of outcome variables: continuous, discrete, and time-to-event. The data were independently analysed with separate R-scripts and the results were compared with those from the R-statistical environment.

The datasets needed for these two examples are included in ECRIN’s Github (data_A_with_censor.csv, data_B_with_censor.csv)²³.

Implementation as open source and in the VRE

The R-statistical environment is fully open-source, and all necessary information for its implementation, including the source code, can be found in ECRIN’s GitHub repository²³. It is registered in ZENODO (https://zenodo.org/records/14718597). In addition, it can be deployed on-premises for local use or accessed as a Software-as-a-Service (SaaS) via the Virtual Research Environment (VRE) developed under the EU-funded SIMCor project. Its modules are accessible at:

Validation module: https://simcor.unitbv.ro/shiny/validation-module/
Application of validated cohorts’ module: https://simcor.unitbv.ro/shiny/application-module/

The SIMCor VRE represents the computational platform designed and implemented during the project for the design, conduction, and result analysis of in-silico clinical trials, with a focus on modelling and simulation to assess the conformity, safety, and efficacy of cardiovascular devices²⁴. In its current online version, it covers the VRE Drive, a data repository and interoperability environment for the exchange of data among the various VRE components, the virtual cohort generator for the generation of a new population of virtual patients, based on a set of characteristics provided as input parameters, a module for virtual TAVI device Implantation and the R-statistical environment.

The VRE is also referenced in a SOP about virtual cohorts generation and validation²⁵.

Application of the R-statistical environment in SIMCor

In the SIMCor-project three concrete in-silico trials were performed:

TAVI-1: Effects of convergent and divergent Left Ventricular Outflow Tract (LVOT) shapes on paravalvular leakage (PVL) after virtual TAVI
TAVI-3: Effects of calcification on PVL after virtual TAVI
PAPS-1: Effects of pulmonary artery side branches on haemodynamics and fixation safety of PAPS, to assess whether detailed assessment of the landing site during device implantation is necessary.

The procedure started with the formulation of application scenarios. Thereafter the study protocol was developed, including a detailed description of the datasets. Then, the virtual cohorts for the in-silico trials were generated and finally, the data analysis was performed.

Due to time reasons, the analysis in the three in-silico trials could not be performed in the R-statistical environment but was done separately with R studio. It was, however, assessed whether the different analysis procedures applied in the in-silico trials could have been performed within the R-statistical environment. The results of this comparison are summarised in Table 2:

Table 2 Statistical analysis of in-silico trials in SIMCor.

Full size table

It turned out that the basic techniques could have been applied in the R-statistical environment, however, more sophisticated techniques (e.g., linear mixed-effects model) are still missing in the application.

Discussion

In biomedical, pharmaceutical and toxicology research, the safety and efficacy of biomedical products is ultimately tested on humans via clinical trials after prior laboratory testing in-vitro and/or in-vivo on animals. The complete development chain of a new biomedical product and its introduction to the market is very long and expensive. Alternative methodologies to reduce the animal and human testing are needed to address the safety and efficacy issues of clinical human trials, the ethical issues and the imperfection of predictions issued from laboratory and animal studies when applied to humans²⁶. Regulatory bodies, such as the FDA, are more and more realising, that computational (in silico) modelling and simulation (M&S) are powerful tools that complement traditional methods for gathering evidence–including bench-top (in vitro) testing, and animal or clinical (in vivo) studies–about products regulated by FDA or for developing FDA policy^20,27,28. In -silico medicine may significantly speed up the design process and reduce costs in medical device and medicinal product R&D by enabling virtual prototyping and testing through computational models²⁹. As a consequence, standards, frameworks and guidelines have been developed characterising and promoting in-silico trial technologies (e.g.,^{1,30,31,32,33,34}).

Similarly, increasing uptake of in-silico trials will require to ensure replicability and reproducibility of findings and model outcomes and thus their credibility. Even though the ongoing “replication crisis” is not only affecting computational biomedical research, this field is particularly sensitive due to the use of complex in-silico models and data, usually used for model parameterisation, which are often not freely available due to intellectual property or data privacy-related issues^35,36. To address this issue robust tools for planning and evaluation of in-silico trials, which facilitate systematic, complete and automatic documentation of all individual steps performed are considered to be strongly beneficial. This aspect was specifically targeted in our study for the R-statistical environment by adequate integration of a report function, fostering systematic and complete documentation of all individual steps performed in the planning and analysis. This can be considered as a significant contribution to reduce the replicability crisis in the field of computational modelling and simulation, admitting that this is only a small part of the full development and application chain for virtual cohort generators.

A general issue to be solved for the area of virtual cohorts and in-silico trials is to provide standardised and harmonised metadata, characterising computational models but also statistical algorithms applied in the analysis of data related to it. That there are still issues with reproducibility/replicability has been demonstrated in several publications (e.g.³⁷). The implementation of reproducible research for in-silico analyses requires extensive metadata to describe both scientific concepts and the underlying computing environment and metadata provide context and provenance to raw data and methods and are essential to both discovery and validation^38,39. Metadata standards should be applied across the full „analytical stack “ consisting of input data, tools, reports, pipelines and publications³⁶. An adequate metadata model serves to support discoverability of a virtual cohort by describing the underlying model and the creation of the virtual cohort (FAIR data⁴⁰). In addition, metadata on the data upload should be included. Similar information could be included for the data usage and analysis. Glossary of terms for computer modelling & simulation, such as the second release of the Avicenna glossary are a first step in the right direction but not sufficient⁴¹. Different ontologies for modelling and simulation are available but difficult to apply and far from being standardised⁴². Metadata for statistical tools are data (information) used to describe statistical objects⁴³. Statistical metadata are best understood as structured information. One of the next steps should be to extend the R-statistical environment with standardised metadata. Another area of relevance would be to link the R-statistical environment to the CDISC (Clinical Data Interchange Standards Consortium)-standard for clinical trials. First attempts in this direction have been taken in the SIMCor project but much more work is needed⁴⁴.

To avoid redundancy and unnecessary work, the development of a new tool should only be triggered, if the functionality needed is not covered by already existing tools. For that reason, we performed a survey, trying to assess the status of existing computer tools for support of analysis of virtual cohorts and in-silico trials. Only R packages, known for their widespread use, that were listed in CRAN were included in the search as these packages are validated by the R foundation⁴⁵. In total 8 applications could be identified, relevant for computational modelling. Closer consideration of the tools revealed that the target of these tools was limited to specific aspects, such as simulation of specific trial types (e.g., platform trials, pharmacokinetic-pharmacodynamic studies, adaptive trials) or application of Bayesian techniques (supplementary material S1). So, some of the tools cover aspects relevant for supporting the validation of virtual cohorts or applying virtual cohorts in in-silico trials but none of the tools seemed to be generic enough to cover the functionality needed in the SIMCor project.

The choice to use the programming language R and the Shiny package to develop the web application was influenced by their popularity, open-source nature, and the extensive international community that supports them. One might question why other options were not considered for developing the tool and whether this decision significantly limits its applicability. An alternative approach involves using the Python programming language, which is versatile and commonly utilized for data science projects, such as building websites, automating tasks, and processing images and text statistically. Python is a functional programming language similar to R and integrating both languages is straightforward. For instance, the reticulate package offers an R interface to Python modules, classes, and functions⁴⁶. Conversely, it is also possible to run R code from Python using the rpy2 package⁴⁷. Additionally, RStudio allows users to work with both languages within an integrated development environment. Therefore, choosing R as the statistical tool for this project should not be viewed as a limitation on its usage or applicability.

An R-statistical environment alone is certainly of benefit for potential users, but it realistically only covers part of the pipeline related to the development and application of virtual cohorts as well as in-silico trials. Strengths lie in the possibility to upload any dataset from any domain (currently only CSV) and to apply statistical algorithms, which are of value and used for the validation of virtual cohorts and the application of validated cohorts in in-silico trials. It would be, however, a major step forward, if the statistical techniques implemented in the R-statistical environment could be coupled with computational models for generating virtual cohorts (so-called virtual cohort generators). This would allow it to support the workflow from generating virtual cohorts with computational models to analysis of the generated data in one environment. This approach was followed in the SIMCor-project by integrating virtual cohort generators with the R-statistical environment in a VRE²⁴. The VRE still has some limitations, so the use is currently restricted to specific users and the full pipeline of computational modelling is not integrated yet with a need to connect local environments for complex simulations. Nevertheless, the VRE is a necessary and promising step towards an open and powerful computational environment for developing and applying virtual cohorts.

The R-statistical environment developed in SIMCor is a first and promising step, but it needs considerable further improvement. The availability of statistical algorithms for analysis of virtual cohorts and of tools to plan in-silico trials is still limited. In the three in-silico trials of the SIMCor project it turned out that much more statistical techniques need to be covered by the tool to be useful. As an example, linear mixed-effects models had to be applied here in two of the three in-silico trials performed in SIMCor (TAVI-1, PAPS-1), which was done outside the R-statistical environment. Many more techniques can be named that could be of benefit for potential users when dealing with virtual cohorts and in-silico trials (e.g. multivariate logistic regression, sample size estimation for in-silico trials based on odds ratio).

Finally, the R-statistical environment was developed within the SIMCor project and thus any experiences with the application so far are limited to the domain investigated and the project partners. The next step should be to gain more experience by applying the tool by other researchers, in other domains and for other questions. With the code and supporting information (implementation and user guide) openly available in GitHub and the underlying statistical techniques fully described in the model description, this should be feasible. A concrete next step is to make the software available under CRAN. CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R. Co-author PEV has major experience with CRAN and has implemented several packages under CRAN (Jarbes, bamdit). Furthermore, the entirety of the SIMCor Virtual Research Environment, including its modules, such as the R-statistical environment have been included in the EDITH catalogue. This project is establishing the foundation of the European Virtual Human Twin infrastructure, which will be implemented subsequently. It is envisaged to include the R-statistical environment within this framework together with its application examples. On a shorter time frame, the application scenarios described in this work will be individually published, demonstrating the methods and findings of each respective in-silico trial. They will directly refer to the use of the R-statistical environment for the design and evaluation of the in-silico trial and therefore provide comprehensive application examples for the tool.

Conclusions

In this project, an open, generic, and menu-driven web application has been developed in Shiny to support the validation and application of virtual cohorts in specific use cases related to cardiological medical devices. The tool is applicable to other use cases, research environments, and scientific domains for the development or regulatory evaluation of a medicinal product, device, or intervention.

Data availability

Code and examples are available under ECRIN GitHub: SIMCor. https://github.com/ecrin-github/SIMCor. and ZENODO: https://zenodo.org/records/14718597. A detailed description of the R-statistical environment (including survey, user stories) is available under https://drive.google.com/drive/folders/1_cRFCkLn1mYbUjy_yrOxOwdtgmkNQXCM (SIMCor google drive). This file will be published in ZENODO in case the manuscript is accepted for publication. The example for “validation of a virtual cohort” (Verstraeten et al.) is available under https://data.4tu.nl/datasets/3f6a3788-96e6-4b81-b37b-f07eeec85965

References

Pappalardo, F., Russo, G., Tshinanu, F. M. & Viceconti, M. In silico clinical trials: Concepts and early adoptions. Brief Bioinform. 20, 1699–1708. https://doi.org/10.1093/bib/bby043 (2019).
Article CAS PubMed Google Scholar
Viceconti, M. et al. In silico trials: Verification, validation and uncertainty quantification of predictive models used in the regulatory evaluation of biomedical products. Methods 185, 120–127. https://doi.org/10.1016/j.ymeth.2020.01.011 (2021).
Article CAS PubMed PubMed Central Google Scholar
Davies, M. R. et al. An in silico canine cardiac midmyocardial action potential duration model as a tool for early drug safety assessment. Am. J. Physiol. Heart Circ. Physiol. 302, H1466–H1480. https://doi.org/10.1152/ajpheart.00808.2011 (2012).
Article CAS PubMed Google Scholar
Clermont, G. et al. In silico design of clinical trials: A method coming of age Crit. Care Med. 32(2061), 70. https://doi.org/10.1097/01.ccm.0000142394.28791.c3 (2004).
Article Google Scholar
Sarrami-Foroushani, A. et al. In-silico trial of intracranial flow diverters replicates and expands insights from conventional clinical trials. Nat. Commun. 12, 3861. https://doi.org/10.1038/s41467-021-23998-w (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Magnusson, M. O. et al. Dosing and switching strategies for paliperidone palmitate 3-month formulation in patients with schizophrenia based on population pharmacokinetic modeling and simulation, and clinical trial data. CNS Drugs 31, 273–288. https://doi.org/10.1007/s40263-017-0416-1 (2017).
Article CAS PubMed Google Scholar
Musuamba, F. T. et al. Scientific and regulatory evaluation of mechanistic in silico drug and disease models in drug development: Building model credibility. CPT Pharmacometrics Syst. Pharmacol. 10, 804–825. https://doi.org/10.1002/psp4.12669 (2021).
Article CAS PubMed PubMed Central Google Scholar
Badano, A. In silico imaging clinical trials: Cheaper, faster, better, safer, and more scalable. Trials 22(1), 64. https://doi.org/10.1186/s13063-020-05002-w (2021).
Article ADS PubMed PubMed Central Google Scholar
Czypionka T., Eisenberg, S., Kraus, M., Reiss, M., Rösler, D., Zech, C. Deliverable 10.4 „Industry and market impact report”, Institute for Advanced Studies (IHS), M42. Available at: https://ec.europa.eu/research/participants/documents/downloadPublic?documentIds=080166e50ebc449c&appId=PPGMS (2024)
InSilicoTrials: Hyper-Accelerate Digitalization of Pharma. https://insilicotrials.com/platform/ (2024)
Cheng et al. QSP Toolbox: Computational Implementation of Integrated Workflow Components for Deploying Multi-Scale Mechanistic Models. AAPS J. 19: 1002. https://doi.org/10.1208/s12248-017-0100-x (2017)
Russo, G. et al. In silico trial to test COVID-19 candidate vaccines: A case study with UISS platform. BMC Bioinform. 21, 527. https://doi.org/10.1186/s12859-020-03872-0 (2020).
Article CAS Google Scholar
Abdelmonein, A., Khalid, S. I., Fadl, H. A. O., Rewane, A. & Elbager, S. G. Exploring the power and promise of in silico clinical trials with applications in COVID-19 infection. Sudan J. Med. Sci. 16, 355. https://doi.org/10.18502/sjms.v16i3.9697 (2021).
Article Google Scholar
Highly Efficient Clinical Trial Simulator (HECT). https://mtek.shinyapps.io/hect/ (2024)
SIMCor (In-Silico testing and validation of Cardiovascular IMplantable devices). https://www.simcor-h2020.eu/ (2024)
SIMCor: Virtual Research Environment: https://simcor.unitbv.ro/vre/ (2024).
In Silico World. https://insilico.world/ (2024)
SimInSitu. https://www.siminsitu.eu/ (2024)
SimCardioTest. https://www.simcardiotest.eu/wordpress/ (2024)
FDA: Successes and opportunities in modeling & simulation for FDA. A report prepared by the Modeling & Simulation Working Group of the Senior Science Council. Chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.fda.gov/media/163156/downloa (2022).
Verstraeten, S. et al. Generation of synthetic aortic valve stenosis geometries for in silico trials. Int. J. Numer. Methods Biomed. Eng. 40, e3778. https://doi.org/10.1002/cnm.3778 (2023).
Article Google Scholar
Verstraeten, S. et al. Data underlying the publication: Generation of synthetic aortic valve stenosis geometries for in silico trials. 4TU.ResearchData. https://doi.org/10.4121/3f6a3788-96e6-4b81-b37b-f07eeec85965.v1 (2023)
ECRIN github: SIMCor. https://github.com/ecrin-github/SIMCor (2024)
SIMCor Virtual Research Environment. https://www.simcor-h2020.eu/the-simcor-vre-is-online (2024).
Huberts, W. et al.: SIMCor. Deliverable 4.3 - SOPs for virtual cohorts generation and validation (TUE, M36). Zenodo. https://doi.org/10.5281/zenodo.10932377 (2023).
European Commission: CORDIS - EU research results. In-silico trials for developing and assessing biomedical products: https://cordis.europa.eu/programme/id/H2020_SC1-PM-16-2017 (2024)
Aycock, K. I. Toward trustworthy medical device in silico clinical trials: A hierarchical framework for establishing credibility and strategies for overcoming key challenges. Front. Med. 12, 1433372. https://doi.org/10.3389/fmed.2024.1433372 (2024).
Article Google Scholar
Pathmanathan, P. Credibility assessment of in silico clinical trials for medical devices. PLoS Comput. Biol. 20, e1012289. https://doi.org/10.1371/journal.pcbi.1012289 (2024).
Article CAS PubMed PubMed Central Google Scholar
Medical Devices: Use computational modelling and simulation (CM&S) to design, test, and validate medical devices, https://www.mathworks.com/solutions/medical-devices/in-silico-medicine.html#:~:text=In%20silico%20medicine%20significantly%20speeds,and%20testing%20through%20computational%20models (2024)
Viceconti, M. et al. POSITION PAPER: Credibility of in silico trial technologies: A theoretical framing. IEEE J. Biomed. Health Inform. 24, 4–13. https://doi.org/10.1109/JBHI.2019.2949888 (2020).
Article PubMed Google Scholar
Bodner J. & Kaul, V. A framework for in silico clinical trials for medical devices using concepts from model verification, validation, and uncertainty quantification (VVUQ). Proceedings of the ASME 2021 Verification and Validation Symposium. https://doi.org/10.1115/VVS2021-65094 (2021)
Viceconti, M., Henney, A. & Morley-Fletcher, E. In silico clinical trials for developing ad assesssing biomedical products. Int. J. Clin. Trials 3, 37–46. https://doi.org/10.18203/2349-3259.ijct20161408 (2016).
Article Google Scholar
Horner, M., Reiterer, M., Rousseau, C.F. Avicenna Alliance Position Paper. Ensuring the quality of in silico evidence: Application to medical devices. Avicenna Alliance. chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.avicenna-alliance.com/upload/avicenna-alliance-position-paper-in-silico-evidence-application-to-medical-devices-28-may-2021_64bfd88f420a0.pdf (2021)
The American Society of Mechanical Engineers (ASME): Assessing Credibility of Computational Modeling through Verification and Validation: Application to Medical Devices. VV-40 – 2018. https://www.asme.org/codes-standards/find-codes-standards/assessing-credibility-of-computational-modeling-through-verification-and-validation-application-to-medical-devices (2018)
Tiwari, K. et al. Reproducibility in systems biology modelling. Mol. Syst. Biol. 17, e9982. https://doi.org/10.15252/msb.20209982 (2021).
Article PubMed PubMed Central Google Scholar
Blinov, M. L. et al. Practical resources for enhancing the reproducibility of mechanistic modeling in systems biology. Curr. Opin. Syst. Biol. 27, 100350. https://doi.org/10.1016/j.coisb.2021.06.001 (2021).
Article CAS Google Scholar
Sandve, G. K., Nekrutenko, A., Taylor, J. & Hovig, E. T. Simple rules for reproducible computational research. PLoS Comput. Biol. 9, e1003285. https://doi.org/10.1371/journal.pcbi.1003285 (2024).
Article Google Scholar
Stodden, V. et al. Enhancing reproducibility for computational methods. Science 354, 1240. https://doi.org/10.1126/science.aah6168 (2016).
Article ADS CAS PubMed Google Scholar
Leipzig, J., Nüst, D., Hoyt, C. T., Ram, K. & Greenberg, J. The role of metadata in reproducible computational research. Patterns 2, 100322. https://doi.org/10.1016/j.patter.2021.100322 (2021).
Article PubMed PubMed Central Google Scholar
Wilkinson, M. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018. https://doi.org/10.1038/sdata.2016.18 (2016).
Article PubMed PubMed Central Google Scholar
Avicenna Alliance Glossary. https://www.avicenna-alliance.com/glossary.html (2024)
May, M. C., Kiefer, L., Kuhnle, A. & Lanza, G. Ontology-based production simulation with OntologySim. Appl. Sci. 12(3), 1608. https://doi.org/10.3390/app12031608 (2022).
Article CAS Google Scholar
National Academies. Science Engineering Medicine: Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. https://nap.nationalacademies.org/catalog/26360/transparency-in-statistical-information-for-the-national-center-for-science-and-engineering-statistics-and-all-federal-statistical-agencies (2022)
Aydin, B., Kiely, A. & Ohmann, C. Feasibility assessment of using CDISC data standards for in silico medical device trials. J. Soc. Clin. Data Manag. 4(1). https://doi.org/10.47912/jscdm.230 (2023).
The Comprehensive R Archive Network (R-CRAN). https://cran.r-project.org/ (2024)
Ushey ,K, Allaire ,J, Tang ,Y. reticulate: Interface to ‘Python’. R package version 1.39.0, https://github.com/rstudio/reticulate, https://rstudio.github.io/reticulate/.(2024)
Gautier, L. rpy2 (Version 3.5.14). Retrieved from https://github.com/rpy2/rpy2/releases/tag/RELEASE_3_5_14 (2023)

Download references

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017578. The authors wish to thank Michael Stiehm, Jan Oldenburg, Finja Borowski (Institut für Implant Technologie und Biomaterialien e.V., Rostock) and Wouter Huberts, Sabine Verstraeten (Department of Biomedical Engineering, Eindhoven University of Technology) for providing information from the in-silico clinical trials performed in SIMCor. In addition, the authors which to thank the SIMCor project manager Anna Rizzo (Lynkeus, Rome).and Sergio Contrino, Léopold Cudilla (ECRIN) for support with GitHub restructuring.

Author information

Authors and Affiliations

European Clinical Research Infrastructures Network (ECRIN), Kaiserswerther, Strasse 70, 40477, Düsseldorf, Germany
Christian Ohmann
European Clinical Research Infrastructure Network (ECRIN), 30 Bd Saint-Jacques, 75014, Paris, France
Takoua Khorchani
Automation and Information Technology, Transilvania University of Brasov, Mihai Viteazu nr. 5, 5000174, Brasov, Romania
Alexandru Cracanel
Institut Für Kardiovaskuläre Computer-Assistierte Medizin, Charité - Universitätsmedizin Berlin, Augustenburger Pl. 1, 13353, Berlin, Germany
Jan Brüning
Coordination Centre for Clinical Trials, Heinrich Heine University Düsseldorf, Moorenstrasse 5, 40225, Düsseldorf, Nordrhein-Westfalen, Germany
Pablo Emilio Verde

Authors

Christian Ohmann
View author publications
Search author on:PubMed Google Scholar
Takoua Khorchani
View author publications
Search author on:PubMed Google Scholar
Alexandru Cracanel
View author publications
Search author on:PubMed Google Scholar
Jan Brüning
View author publications
Search author on:PubMed Google Scholar
Pablo Emilio Verde
View author publications
Search author on:PubMed Google Scholar

Contributions

CO coordinated the project and provided the initial and final version of the manuscript. TK developed and tested the Shiny application. PEV provided the underlying general statistical model, developed R-scripts for testing and supervised the Shiny development. AC implemented the Virtual Research Environment with integration of the R-statistical environment. JB was the liaison person to the SIMCor-project and provided feedback with respect to computational models and simulation for the cardiological use cases. All authors reviewed the manuscript and approved the final version.

Corresponding author

Correspondence to Christian Ohmann.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ohmann, C., Khorchani, T., Cracanel, A. et al. An open source statistical web application for validation and analysis of virtual cohorts. Sci Rep 15, 15744 (2025). https://doi.org/10.1038/s41598-025-99720-3

Download citation

Received: 04 November 2024
Accepted: 22 April 2025
Published: 06 May 2025
Version of record: 06 May 2025
DOI: https://doi.org/10.1038/s41598-025-99720-3

Subjects

Abstract

Similar content being viewed by others

Validation of an interactive process mining methodology for clinical epidemiology through a cohort study on chronic kidney disease progression

Approaches to protocol standardization and data harmonization in the ECHO-wide cohort study

An analysis of alternative forced oscillation technique reporting and validation methods for within- and between-sessions in healthy adults

Introduction

Methods

Survey on existing tools

Statistical environment and requirements

Statistical algorithms to be implemented

Results

Implementation

General aspects

Implemented functionality

Validation of virtual cohorts

Univariate comparison

Bivariate comparison

Multivariate comparison

Variability assesment

Application of validated cohorts

One-group design

Two-group design

Sample size estimation

Testing and validation

Validation of virtual cohorts

Application of validated cohorts

Implementation as open source and in the VRE

Application of the R-statistical environment in SIMCor

Discussion

Conclusions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links