Introduction

Multimodal data integration of radiological and histological imaging, clinical data, and molecular diagnostics has the potential to advance medicine beyond the current standard of care1. Artificial intelligence (AI), machine learning (ML), and big-data analytic technologies are expected to play a critical role in this advancement, yet these techniques are data hungry and require large amounts of multimodal data to train and validate AI/ML models2. Fortunately, there are now many data commons and repositories that are publicly available thanks to significant efforts by many organizations including the National Institutes of Health (NIH)3, which has supported several repositories, such as The Cancer Imaging Archive (TCIA, https://www.cancerimagingarchive.net/)4,5, The Imaging Data Commons (IDC, https://datacommons.cancer.gov/repository/imaging-data-commons)6, the National Clinical Cohort Collaborative (N3C, https://ncats.nih.gov/research/research-activities/n3c/overview)7, the BioData Catalyst (https://biodatacatalyst.nhlbi.nih.gov/)8, The Database of Genotypes and Phenotypes (dbGaP, https://www.ncbi.nlm.nih.gov/gap/)9,10, and the Medical Imaging and Data Resource Center (MIDRC, https://data.midrc.org/)11. Each was initiated and sustained for different purposes. Some, like TCIA, IDC, and MIDRC, contain de-identified medical images. Others, such N3C and dbGaP, contain medical record information related to medical images, such as clinical measurements taken near or at the time of imaging. The theme of this paper is the interoperability of different data commons with MIDRC.

Interoperability is one of the key guiding principles for scientific data management and stewardship outlined in the FAIR principles (Findability, Accessibility, Interoperability, and Reusability), in which interoperability is defined as “the ability of data or tools from non-cooperating resources to integrate or work together with minimal effort”12. To be interoperable, according to the FAIR principles, the following must be true:

  • The (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

  • The (meta)data use vocabularies that follow FAIR principles.

  • The (meta)data include qualified references to other (meta)data.

The importance of interoperability in digital healthcare systems has been well recognized. Lehne et al.13 stressed the importance of interoperability for AI and big-data analytics, medical communication, medical research, and international cooperation. Perlin14 emphasized the central role of interoperability in realizing the economic and clinical benefits of big data. The authors14 suggested improving patient identification and data matching as one of the priorities in advancing health information technology interoperability, as errors in patient data matching can result in suboptimal care and medical errors.

Despite its well-recognized value, interoperability has not been commonly implemented in practice due to both technical and governance challenges. Lack of interoperability has resulted in isolated data silos and incompatible systems that prevented the linking of data from multiple sources, which is particularly critical in multimodal data applications where data are scarce. Intentional interoperability efforts are needed to create multimodal datasets to address this scarcity.

MIDRC is a multi-institutional collaborative initiative driven by the medical imaging community that was initiated in late summer 2020 to help combat the global COVID-19 health emergency. Leveraging the existing and developing infrastructure provided by the participating organizations, MIDRC serves as a linked-data commons that coordinates access to data and harmonizes data management activities. MIDRC was designed to follow the FAIR principles, including interoperability. Now, MIDRC continues to expand with data ingestion of imaging studies acquired for diseases beyond COVID-19 such as oncology. Since its inception, MIDRC has also fostered FAIR principles by hosting a simple-to-use data portal (http://data.midrc.org) for exploration and cohort building, by sharing data as well as associated algorithms openly and freely.

The purpose of this study was to demonstrate the interoperability between MIDRC and two other data repositories, BioData Catalyst (BDC, Fig. 1) and N3C (Fig. 2) by describing the datasets that result from interoperability efforts between these repositories. The focus was to create cohorts for two example use cases: (1) a cohort for multi-omics association studies and data fusion (demonstrated on BDC) and (2) a cohort for developing AI computer vision models for medical images (demonstrated on N3C).

Fig. 1
figure 1

Overview of interoperability between MIDRC and BioData Catalyst. Relative sizes of the datasets are not shown to scale.

Fig. 2
figure 2

Overview of interoperability between MIDRC and N3C. Relative sizes of the datasets are not shown to scale.

Results

We developed methods of interoperability between MIDRC and BDC and between MIDRC and N3C (see “Methods” for details) and used these methods to curate two multimodal datasets. The first dataset, the Repository of Electronic Data COVID-19 Observational Study (RED CORAL, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002363.v1.p1, dbGaP Study Accession: phs002363.v1.p1) dataset15, was collected by the Prevention and Early Treatment of Acute Lung Injury (PETAL) Network for investigation of demographics, clinical characteristics, risk factors, care practices, outcomes and resource utilization of patients hospitalized with severe acute COVID-19. The clinical data of RED CORAL (demographics, laboratory test results, medical history and blood pressure tests) is hosted by BDC with 1,480 unique patients and the imaging data is hosted by MIDRC. Via interoperability, we identified 1,477 unique patients with matches between BDC and MIDRC, with 1,223 patients having images acquired March1 to April 1, 2020 that are currently hosted on MIDRC. The images include chest X-ray images for 1,200 patients and chest CT images for 226 patients. Table 1 summarizes the clinical and demographic characteristics of the dataset of 1,223 unique patients with matches between BDC and MIDRC.

Table 1 Characteristics of patients in the RED CORAL dataset with clinical data in BDC matched with imaging data in MIDRC.

The second dataset, the National Clinical Cohort Collaborative (N3C)7, is a collection of data from over 80 institutions originally created to expand knowledge and treatment of COVID-19. N3C holds a wide range of clinical data, including clinical observations, lab results, medication records, procedure descriptions, and visits. As of March 2024, N3C holds data for 22.1 million unique persons, including over 8.7 million COVID-19 positive cases. Via interoperability, we identified 2,124 unique patients with matches between N3C and MIDRC, with images having been acquired between March 2020 and June 2021. Table 2 summarizes the demographic information of the patients in the matched dataset of 2,124 unique patients. Additional patient characteristics including history of COVID and smoking status were identified using the N3C Logic Liaison tools which serve as value sets that map concepts with values (https://covid.cd2h.org/dashboard/concept-sets).

Table 2 Characteristics of patients with clinical data in N3C matched with imaging data in MIDRC.

In order to measure the representativeness of the cohorts to the actual COVID-19 positive patient populations, we used the Jensen-Shannon Distance (JSD) metric16 to characterize the representativeness of the specific datasets relative to cumulative COVID-19 positive patient case counts from the Centers for Disease Control and Prevention (CDC)17 (Tables 35). Measuring representativeness enables user to assess how similar patients in the data subset are to the broader population, and the JSD provides a summary measure that takes into account multiple subgroups. The JSD indicated varying levels of difference in the similarity of the patient demographics in RED CORAL for which imaging data exists in MIDRC and the patients in N3C for which imaging data exists in MIDRC, compared to the cumulative COVID-19 positive patient case counts (Fig. 3). It appears that the two datasets represent the patient population (as defined by the CDC statistics) well in terms of sex distributions with both JSD values below 0.2. In contrast, the raw JSD values for race and ethnicity distributions are in the range of 0.45–0.6 for both datasets, and the race JSD value for the N3C dataset is still above 0.5 after adjustment assuming missing at random. A close examination of the race distributions indicates substantial difference between the curated MIDRC-N3C overlapping subset and the CDC population statistics. For example, the Black and White subpopulations occupy 76.3% and 15.9% respectively in the N3C dataset whereas the corresponding figures in the CDC population are 8.9% and 46.1% (or 15.1% and 78.6% respectively after adjustment for missing at random). Such a characterization has important implications in the development and assessment of AI/ML models. Subgroup analysis would be necessary to examine the effect of the mismatch and mitigation may be necessary to avoid biased performance18 on the two subgroups.

Table 3 Proportion of COVID-19 positive case counts by sex calculated based on CDC data over the comparison periods used in this study.
Table 4 Proportion of COVID-19 positive case counts by race over the comparison periods used in this study calculated based on CDC data.
Table 5 Proportion of COVID-19 positive case counts by ethnicity over the comparison periods used in this study calculated based on CDC data.
Fig. 3
figure 3

Representativeness of patient cohorts relative to the CDC cumulative COVID-19 + case counts over their respective time periods. Results are shown when calculating the JSD using (1) the raw CDC data and (2) all data adjusted when assuming missing-at-random data. Lower JSD values indicated more similarity between the dataset and its comparison group (cumulative COVID-19 positive patient case counts from the CDC).

Discussion

The curation of these datasets was conducted with the goal of demonstrating interoperability between data repositories funded and created with separate mechanisms and aims. Through collaboration and cooperation between governance organizations, the datasets demonstrate the potential gains that can be made through multimodal research by identifying cohorts of patients with data held at different repositories. Such multimodal datasets may enrich research initiatives, with the potential for greater information compared to using single modality datasets on their own.

The multimodal data interoperability sets described here were intentionally crafted using FAIR principles, which notably include the goal of “minimal effort.” One outstanding factor in interoperability between these repositories that increases the amount of effort for interoperability is the differences in governance systems. By design, the data held in MIDRC undergoes extensive de-identification as they are ingested and thus are freely available to all who register for access. In contrast, N3C places restrictions on data access, does not allow data download, and requires an application process for data uploads, due to the nature of the controlled data it holds. This means that users who wish to interoperate between the two repositories are limited to using N3C as the computational enclave for interacting with the connected data. On the other hand, BDC users can download and compute on the data locally (on their own computer or cloud-based computational spaces) after authorization is granted by dbGaP, and then can access the data during the active period of the authorized project following a Data Use Certification Agreement, which specifies terms such as “for research use”, user responsibilities, non-identification, and so on. It takes considerable administrative efforts to implement interoperability due to the differences in governance systems. Technologically, interoperability between MIDRC and BDC can be automated to a large extent; for this use case, matching of data and linking of identifiers across the two repositories leveraged the Gen3 crosswalk service. However, as implemented here, the Gen3 crosswalk service was not totally automatic because the governance of the two data commons is different. It is possible that early planning of coordination between data repositories with different governance models could broaden the availability of multimodal data through interoperability, reducing time and effort needed.

In the future, interoperability among the many data commons may be facilitated by further development and adoption of multimodal healthcare data standards, which refer to “methods, protocols, terminologies, and specifications for the collection, exchange, storage, and retrieval of information associated with health care applications, including medical records, medications, radiological images, payment and reimbursement, medical devices and monitoring systems, and administrative processes”19. An example of a data standard is the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), which is an open community data standard designed to standardize the structure and content of observational data and to enable efficient analyses that can produce reliable evidence (https://www.ohdsi.org/data-standardization/). Moreover, the use of common data elements, defined as data collection units comprising one or more questions together with a set of valid values, can play a valuable role for widescale interoperability20. These efforts are especially important for addressing the challenges of combining imaging data with other clinical information.

Methods

Interoperability between MIDRC and BDC

The first example use case benefitting from interoperability between data repositories is to acquire multimodal data from multiple sources in order to enable association, correlation, and fusion analyses. For example, for the PETAL RED CORAL dataset, the BDC repository has clinical metadata of COVID-19 patients whereas MIDRC has the medical images of the same patients. Here the use case is to investigate the associations of multi-modality data (e.g., medical images and clinical laboratory testing) or integrate the multimodal data using AI/ML fusion models for a certain clinical task such as the characterization of COVID severity to tailor patient treatments (Fig. 4).

Fig. 4
figure 4

Overview of example use case of interoperability between MIDRC and BioData Catalyst for multimodal data fusion.

The demonstration of interoperability between MIDRC and BDC is to match patients with clinical data in BDC to corresponding patients with imaging data in MIDRC. To implement this, we follow a few steps as illustrated (in a simplified presentation) in Fig. 5.

Fig. 5
figure 5

Workflow for interoperability between MIDRC and BioData Catalyst.

Both MIDRC and BDC operate upon the Gen3 Data Platform (https://gen3.org/). Gen3 is an open-source data platform for managing, analyzing, and sharing biomedical data. It supports open Application Programming Interfaces (APIs) so that the data it manages are findable, accessible, interoperable and reusable (FAIR)12. The FAIR APIs that both MIDRC and BDC provide are the essential foundation for the interoperability described in this study. Gen3 supports data objects, such as image files; structured data, such as clinical data; and semi-structured data, such as JSON-based metadata. Gen3 based platforms, including MIDRC and BDC, support a variety of identity providers for authenticating users, including InCommon (https://incommon.org/) and ORCID (Open Researcher and Contributor Identifier, https://orcid.org/). Gen3 based platforms also support several methods for managing authorization information specifying which users are authorised to access which data. In particular, Gen3 can interoperate with the NIH dbGaP system21, which is used by BDC to manage authorization information.

Data linkage across MIDRC and BDC is performed through Gen3’s opaque Globally Unique IDentifiers (GUID) that follow the DRS GA4GH standard22. An DRS identifier has a prefix specifying a particular data platform, such as MIDRC or BDC, followed by a GUID. Opaque in this context means that the GUID does not contain a name, medical record number or any other string that has semantic information.

Accessing data from both MIDRC and BDC requires authentication. In addition, access to the BDC clinical data used for this study is controlled access and requires authorization through dbGaP. The required authentication and authorization is handled by MIDRC and BDC through the Gen3 Fence service. Once a user is authenticated and authorized and a cohort or dataset is specified, the DRS identifiers for the data in the cohort or dataset can be accessed in a workspace or downloaded. In particular, a list of DRS identifiers for the MIDRC PETAL RED CORAL Imaging dataset can be obtained in this way and a list of DRS identifiers for the BDC PETAL RED CORAL dataset can also be obtained. Note that for a user to be authorized to access the RED CORAL clinical dataset on BDC, the user must register on dbGaP, create a project, submit a Data Request Form, and have the Data Request Form approved by the Data Access Review Committee21. After approval, the dbGaP authorization information for the user is updated, and this information is available for systems that are approved for interoperating with dbGaP, such as Gen3.

Gen3 has a privacy preserving record linkage (PPRL) service called the Crosswalk Service that, given DRS identifiers for images from a data commons, can provide the associated DRS identifiers for matching data in another commons. For example, given a list of DRS identifiers for images in MIDRC, the Crosswalk service can provide the DRS identifiers for associated clinical data in BDC, and vice versa. Note that since both MIDRC and BDC use privacy preserving opaque identifiers, the Crosswalk service must be provided with a mapping or cross linking of these opaque identifiers. This mapping is usually provided when data are submitted, but can be done at any time. It is important to note that even with this cross linking of identifiers, all the information is still private since all the identifiers are opaque and contain no PII.

With these lists of DRS identifiers, the image data and corresponding clinical data can be easily exported from the commons and imported into any analysis environment that is approved for managing controlled access data and is authorized to interoperate with the commons23. Sometimes these are called authorized environments, computational enclaves, or freeports23. For this study, the image and corresponding clinical data were imported into workspaces that were part of Gen3’s Biomedical Research Hub24, and which are approved analysis environments for analyzing controlled access BDC data.

In summary, the demonstration of interoperability between MIDRC and BDC demonstrates the process for meeting the objective of collecting patient data from both an imaging and non-imaging data source.

Interoperability between MIDRC and N3C

The second example use case benefitting from interoperability between data repositories is to aggregate data in one to create cohorts in another in order to enable development of AI models that incorporate data across modalities. In this example use case, the goal is to develop an algorithm based on medical images to predict severity of COVID-19 disease, defined as the admission of a COVID-19 positive patient to the intensive care unit (ICU) or intubation within 24 hours of chest radiography. The chest radiographs are to be collected from MIDRC, while the clinical data (i.e., information on ICU admission and/or intubation or lack thereof) are to be collected from N3C, which ingests data and transforms the associated data models to a harmonized Observational Medical Outcomes Partnership (OMOP) analytics dataset (https://ncats.nih.gov/research/research-activities/n3c/covid-enclave/data-overview; https://www.ohdsi.org/data-standardization/). Therefore, a MIDRC user developing the algorithm would aim to create two cohorts of images: (1) patients with severe COVID and (2) patients with mild COVID (Fig. 6).

Fig. 6
figure 6

Overview of example use case for cohort building between MIDRC and N3C for developing AI algorithms that incorporate image-based data.

The first step for interoperability is to identify and characterize the subjects with relevant data in each data repository, based upon the task for which the AI is to be developed. MIDRC and N3C have different governance (including models of access), which impact the interoperability workflow. Conducting this use case requires that users have separate log-in accounts at MIDRC Open Data Commons and at N3C. Any registered user at the MIDRC Open Data Commons can download images, while registered users at N3C must complete a Data Use Request (DUR) for the specific study they are conducting and agree to keep the individual N3C data within an N3C computational enclave (freeport). At this time, the matching of patients with data in both N3C (clinical data) and MIDRC (imaging data) is conducted using Privacy Preserving Record Linkage via an honest broker. These matches are produced on request and held as a table within N3C, which for now operates as the freeport enclave. Additionally, it is recommended that users use this list as the starting point and then download from MIDRC only those images relevant for the study and associated with the patients in the match table.

Subsequently, the user can calculate an imaging-derived measure (such as the severity index25) on the images from the MIDRC cohort using their local computer, and then import into the freeport enclave (which is for now limited to the N3C enclave) via an upload request. Figure 7 outlines the steps required for interoperability between N3C and MIDRC.

Fig. 7
figure 7

Workflow for interoperability between MIDRC and N3C.

In summary, the demonstration of interoperability between MIDRC and N3C demonstrates the process for meeting the objective of creating patient cohorts of imaging data based upon characteristics from clinical data.

Characterization of the representativeness of the curated datasets

To characterize the representativeness of the two datasets we curated via interoperability, we evaluated the demographic characteristics of patients in each of the two datasets for the categories of sex, race, and ethnicity. Similar to our previous work in this area26, we compared each of these demographic characteristics to the cumulative COVID-19 positive case counts as reported by the Centers for Disease Control and Prevention17 over the period of image collection, using the Jensen-Shannon Distance (JSD)16 as a measure of similarity. The JSD is bounded between 0 and 1 when log2 is used within the analytical expression for this measure, where JSD = 0 indicates complete similarity between two distributions and JSD = 1 indicates no similarity.

A practically important issue is that it is common to have missing data in the collected demographic characteristics. As a hypothetical example, we assume the sex distribution of a curated dataset is as follows: 45% female, 45% male, 10% missing, while the population distribution is as follows: 35% female, 35% male, and 30% missing. A raw JSD score using the three categories (female, male, missing) is 0.18. However, if the missing information can be assumed to be distributed at random (i.e., not associated with sex), then the adjusted distribution would be 50% female and 50% male for both the curated dataset and the population, thereby yielding a JSD score of 0. This means that, if we assume missing data is randomly distributed, the different proportion of missing data would impact the JSD metric. In this study, we provided both the raw and adjusted JSD values.