Introduction

Advances in microbial genomics have fundamentally reshaped our understanding of infectious diseases and their transmission. The COVID-19 pandemic has amplified the uptake of microbial sequencing for communicable disease control1 and international microbial genomics data sharing has become the foundation for surveillance to thwart future pandemics and epidemics2,3,4. Infectious disease control programmes include genomics-based surveillance as an integral strategy towards the control and elimination of these diseases5. Initiatives such as WHO Pandemic Treaty and the International Health Regulations have promoted international data sharing for global health6,7. However, disparities in access to high throughput sequencing and concerns about governance of shared genomics data remain major roadblocks for international data mobilisation4,8,9. Unsurprisingly, up to 37% of countries with the capacity for reporting on SARS-CoV-2 variants uploaded less than half of their sequences of SARS-CoV-2 variants of concern to international public repositories10.

Genome sequencing data has been recognised as an important intangible asset and the potential for a unique value added by the aggregation of international pathogen genomics data is immense11. Benefits of surveillance data linkage include more accurate data less prone to selection bias, the ability to study health in population-based cohorts, and improved case finding and contact tracing12,13. The utility of microbial genomics data has been gradually expanded from applied research to public health response, modelling of diseases and preventive interventions as well as the design of drugs and vaccines2,3,11,14. This new value extracted from timely and quality sequencing data and metadata is also critical for artificial intelligence (AI)-ready markets for genomic diagnostics and therapeutics15. However, a lack of contextual metadata accompanying shared sequences significantly limits the utility of pathogen genomic data in open databases and their impact on knowledge discovery and global health. Furthermore, the growing amount of sequencing data existing as “UJAD” (i.e. “unpublished in journals and available in databases”) remains internationally unseen and underutilised.

With the exponential growth of sequencing data, more attention is placed on improving data quality, especially for collaborative deep learning and other AI pipelines16. Here we describe the main features that define the value of microbial genomics data for secondary use which would facilitate and optimise the international mobilisation of genomic data. We employed Vibrio cholerae genomes shared via international repositories as an illustrative representation of a pandemic disease which is a target for elimination (see Methods and Supplementary Table S1). This pathogen remains endemic world-wide, causes sporadic outbreaks elsewhere and is responsible for at least 2.9 million cases of acute diarrheal disease each year17. There are thousands of strains of V. cholerae ranging from low-virulence environmental strains to those capable of causing pandemic cholera. Over the past 200 years, seven pandemics of cholera have been recorded. The current seventh pandemic started in 1961 and has been associated with the evolution of the pathogen that migrated throughout Asia and gradually acquired important virulence-associated elements18. While only the O1 and O139 serogroups of V. cholerae have been responsible for pandemics, other serogroups can cause sporadic cases and outbreaks of diarrhoea and systemic disease in vulnerable populations and act as evolutionary intermediaries in virulence gene acquisition17. The genomics of V. cholerae have been examined by many international groups and have been instrumental in the development and evaluation of vaccines.

The value of pathogen sequencing data can be defined by the nature of disease caused by this pathogen and by the availability of associated clinical, epidemiological and other contextual data (i.e. ‘metadata’)3,19. The larger the sets of shared genome sequences are, the more representative they can be of current disease activity and the more likely clusters of infections with a common and removable source can be identified and acted upon. For example, each additional 1,000 genomes of foodborne bacteria added to a national database in the USA could lead to a reduction of ~6 cases per pathogen per year20.

Each sequenced genome of a medically relevant microorganism is expected to have immediate value for the case management or disease control realised at the site of testing in order to justify the cost of sequencing. The focus of this paper, however, is on the deferred value of genome sequences and their associated metadata which can be realised nationally and internationally following data sharing. This deferred (sometimes also referred to as latent) value has usually been demonstrated by the re-analysis or secondary analysis of openly shared genomes. Some examples of deferred value extracted from microbial genomics data are listed in Table 1 and illustrate the importance of relevant contextual metadata for extracting such value as well as current trend of sequencing data reuse which is independent of extensive metadata provisions.

Table 1 Utility of secondary use of pathogen genomic data.

Results

We define the deferred value or reusability of shared data within a single genome as a function of four variables: data quality, novelty, associated metadata – which includes provenance and phenotype data for the microbial strain and the infected host – and the timeliness of genomic data release after the sample collection. The deferred value score was designed to identify anticipated utility for data aggregation and re-use among large sets of microbial genomic data available for secondary analysis (see Methods). This combined score allows a comparison of the value of individual sequences. It should minimise the utility of highly similar genomes from the same location with no contextual metadata and highlight genomes of potential value for data aggregation.

The value index of a collection of genomes represents their aggregated deferred value. It is based on the average value of the individual genomes in the collection, with a correction for the genomic diversity (see Methods). The diversity correction recognises that genomic datasets are typically of greater deferred value when they encompass a representative sample of a larger microbial population rather than focussing on closely related strains from a single outbreak.

Evolution of the nature of international pathogen genomic data sharing

We collected metadata and sequences of all 10,110 assembled genomes of V. cholerae submitted to NCBI GenBank between 2010 and 11 April 2024. We observed exponential growth of genomic data (Fig. 1) reflecting the gradual improvements in sequencing capability in cholera endemic countries (Fig. 2) over this period.

Fig. 1
figure 1

The timeline and volumes of NCBI GenBank submissions of 10,110 assembled V. cholerae genomes from different countries.

Fig. 2
figure 2

Choropleth map of V. cholerae assembled genome submissions to NCBI GenBank. Countries with high cholera burden estimates in each WHO region are labelled56. The number of assembled genomes on NCBI GenBank is reflected by colour, and the proportion of these that belong to the pandemic O1 or O139 serogroups based on in silico serotyping in our analysis is indicated.

The majority of genomic data was provided by countries with advanced economies, and the regions where cholera is endemic remained under-represented (Figs. 2, 3A). A total of 9,213 genomes could be in silico serogrouped, of which 65% (5,970/9,213) were O1 and 6.1% (567/9,213) were O139. The remainder constituted less virulent serotypes, which might be less relevant for epidemic disease control but are nevertheless important for understanding diversity and ecological niches. Cholera remains endemic (i.e. over 10 cases per 10,000 population) in 49 countries of Africa and Asia (Ending Cholera Report, 2017)21 but only 25 of them were represented in the dataset as sample origins, and only six of these (i.e. People’s Republic of China, Bangladesh, Mozambique, Ghana, India, and Sudan) as sequence submitters. Importantly, while academic institutions (i.e. universities and research institutes) as early adopters of the sequencing technologies were dominant suppliers of genomic data at the start of the period, microbiology service providers (i.e. public health and diagnostic laboratories) have significantly increased their input as genomic data providers. In 2023, the most recent complete year, microbiology service providers accounted for over 80% (532/655) of submitted V. cholerae genomes and over 50% (12/23) of submitting institutions (Fig. 3B).

Fig. 3
figure 3

The context of NCBI GenBank V. cholerae genome submissions. (A) The origin of V. cholerae producing samples subjected to the next generation sequencing and data sharing over period of the analysis. (B) Temporal trends in relative contributions of genome sequences by academic institutions and public and private microbiology service providers. (C) Upset plot describing patterns of provenance and phenotype data included in the sequence data submissions by academic researchers and microbiology service providers. (D) Box plots per year of submission describing yearly sample collection to genome submission time lags. The red horizontal line is the overall median (i.e. 8 years).

There was also significant variability of data features determining genomic data quality and the availability of key metadata. Only 36% (2,411/6,774) of assemblies submitted by academic institutions and 73% (2,445/3,336) of assemblies submitted by microbiology service providers included raw sequencing data (i.e. linked to entries on the NCBI Sequence Read Archive [SRA]). For academic institutions, the proportion of submissions without appropriate metadata about the source of samples remained relatively large at 62% (4,172/6,774), as did the proportion of sequences missing the year or country of sample origin at 28% (1,890/6,774). This compares to 17% (549/3,336) without sample source and 18% (601/3,336) without year or country of sample origin for microbiology service providers (Fig. 3C). We note that 3,224 of the BioSamples originated from a single academic institution and 1,062 originated from one reference microbiology service provider. Excluding these samples, the fraction missing source information changed to 27% (948/3,550) for academic and 23% (529/2,274) for microbiology service providers, and the fraction missing year or country changed to 22% (784/3,550) for academic and 18% (415/2,274) for microbiology service providers. The time lag between the collection of original samples which eventually produced V. cholerae sequences and the subsequent submission of an assembled genome to an international database for sharing was substantial (i.e. median 8 years) and has not improved over the years. There were also multiple examples when this collection to submission time lag exceeded 50 years (Fig. 3D).

Dissecting aggregated data value

The estimated value of genomic data for V. cholerae in NCBI GenBank has significantly increased in the last fifteen years. Figure 4 outlines key features of this data and shows the upward trend in the annual data value. The value index of aggregated submissions from microbiology service providers i.e. public health and clinical laboratories, appears to be higher than one of submissions from academic institutions (0.24 vs 0.04, respectively, Fig. 4A). While the annual value of genomes submitted by academic institutions remained relatively stable over the study period, the cumulative value of genomes provided by microbiology laboratories gradually increased (Fig. 4B). A large submission from a single academic institution in early 2024 with scarce metadata skewed the profile for that year. High, medium and low value genomes were dispersed across different sequence types of V. cholerae (Fig. 4C).

Fig. 4
figure 4

The aggregated value of V. cholerae genomes. (A) Value distribution and value index (lower right of each panel) with the horizontal scale proportional to number of sequences. Grey vertical lines are quartiles. (B) Temporal trends in the value index of annual submissions of V. cholerae genomes shared by academic institutions and microbiology service providers. (C) Minimum spanning tree representing the network distribution of value. The edges between nodes reflect number of allelic differences between related MLST types. Node sizes are proportional to the number of genomes associated with particular MLST. Edges of less than two allelic differences in length are collapsed and the number of genomes within each category is listed in square brackets.

When we focused on all genomes that could be in silico serotyped as either O1 or O139, similar trends were also observed in this subset (Fig. S3). Sharing of V. cholerae O1/O139 genomes during the current 7th pandemic may however offer more global health value for secondary use than sharing genomes of non-toxigenic strains.

Discussion

With growing demand for representative genomic data for secondary analysis, model validation, and to train AI models, it can be expected that producers of genomic data will take a keen interest in the value they are providing to other users when they decide whether to openly share their datasets. There are incentives to develop and maintain a collaborative environment for collective and responsible data value creation and extraction. However, genomic data is a non-rivalrous good (i.e. easy to copy and difficult to protect), making its value difficult to assess22. There is an acknowledgement that there is no direct correlation between the cost of sequencing and the true value of sequenced genomes. In addition, diminishing returns on investment when more closely related microbial genomes are sequenced from similar locations and points of time have been recognised14. The custodians of ‘raw’ genomic data are often disenfranchised for the international release of their potential assets and there is a risk of unequitable reuse of such data23. Our suggested framework helps to estimate the deferred value of shared microbial genomes and potential benefits for both data providers and recipients. It should assist in identifying high-value cases for genome sequencing and sharing, encourage their timely sharing for further value extraction and recognise the redundant data. It can also serve as an enabler for the urgent and sustained investment in microbial genomics and surveillance to support pandemic preparedness, global health and innovation7.

Importantly, an increasing proportion of microbial genomes are generated by a growing number of microbiology service providers implementing sequencing technologies. This reflects the maturation of the sequencing technologies and bioinformatics pipelines and their expansion from academic settings into diagnostic and public health laboratories. This shift has significant implications for microbial sequencing data sharing and stewardship. While research funded sequencing data are made accessible with publication, there is, in general, no such expectation for data generated by health service providers. These volumes of important but otherwise untapped UJAD data are likely to significantly expand in the coming years.

Both restricted availability of timely data and incompleteness of metadata limit the utility of genomic datasets24,25. It is concerning that the prolonged collection to submission time lag in association with V. cholerae genomes has not much improved since 2010. This is in contrast with 78% reduction in time lag achieved during the COVID-19 pandemic when it moved from a medial time of 85 days in 2020 to 19 days in 2021 based on 7 million SARS-CoV-2 submissions to GISAID by providers in 208 countries26. The scientific community and custodians of sequence databases should encourage provision of the date of sequencing by uploaders, as this information would allow more nuance in interpreting delays between sample collection and sequence sharing. Our study set includes a pandrug-resistant V. cholerae strain isolated from a cholera patient in Bangladesh – an example of a high-value sequence that would be of benefit to others if shared promptly – that was uploaded to GenBank one year after the sampling date, and not formally published until a further year later27. While there is insufficient information to determine why a given sequence was not promptly shared, we suggest that an appropriate measure of deferred value could incentivize greater sharing of such high-value sequences. The mechanism to incentivize sharing might take the form of an evidence-driven argument to funding agencies about the benefit of sharing data openly, or a data marketplace with some amount of compensation whether in a financial or reputational sense. Further work is warranted to develop governance structures that would support these approaches28.

Our example dataset indicated significant under-representation in public databases of relevant pathogen genomes from areas of their endemicity. It resonates with the COVID-19 pandemic experience when over 85% of SARS-CoV-2 genomes in GISAID have been provided by the laboratories from North America, Europe and Australia10. Such significant gaps in genomic data from low and middle-income countries disproportionally affected by epidemic diseases can decrease the aggregated value of genomic data. Genomic data from low-income but high disease prevalence countries may have an intrinsically higher value to the international data users as it would reduce bias in more accessible datasets from high-income low prevalence countries. Sequencing and sharing of archived samples collected before the routine availability of sequencing technologies, as is evidently common with V. cholerae, can enrich genomic datasets when they contribute genomic diversity that is not otherwise present in the dataset or provide historical context for the long-term evolution and genesis of lineages.

These findings reiterated the importance of equitable and timely sharing of sequences with appropriate metadata29. The consensus metadata schema should utilise suitable interoperable ontologies such as the Genomic Epidemiology Ontology (GenEpiO), which was developed according to the principles and practices of the Open Biological and Biomedical Ontology Foundry19,30, as well as other data interoperability standards31. The consensus metadata schema should also capture the key applications, such as public health and drug resistance surveillance, drug and vaccine discovery research, investigation of healthcare transmission events, as well as benchmarking and training of AI models32. The contemporary approach by international databases of accepting sequencing data with only minimal meta-data to overcome the governance barriers can be increasingly counterproductive as it a priori diminishes the value of data and the opportunity for shared value creation. It additionally imposes a significant burden on secondary users of the data to synthesize sources, interpret metadata and seek clarification or elaboration from contributors33. We need systems that can work alongside diverse cultural and legal traditions contributing to differences in health data protection legislation, to facilitate genomic data value creation and extraction, and reward value creators29.

Some limitations of our approach in the valuation of genomic data available to aggregation should be acknowledged. First, we applied a relatively narrow view of genomic data quality focusing primarily on the data accuracy and completeness rather than on data credibility and consistency. The value scores utilised in our exemplar dataset provide only a basic estimation of the non-monetary deferred value of shared microbial genomes irrespective of disease-specific attributes. They are focused on reusability of data and might be applied as the first step to identify genome uploads of potentially high or low value for data aggregation and secondary use including machine learning. Second, the value of genomic data release and aggregation can heavily depend on the application scenario and the stakeholder’s perspective. There could be inherent quality differences between data in user submitted public datasets and carefully curated databases for specific organisms23. The value of the same data in different scenarios can vary and stakeholders may have disparate views on how to measure the success of their data sharing. Further work to characterise secondary applications of sequencing datasets would help to refine important elements of the value score and their weighting. Third, we have explored the domain of V. cholerae sequences within NCBI GenBank. The added value of microbial genomics in other infectious diseases and datasets collated by more specialised databases like PubMLST34 might be different. The subset of assembled genomes employed in our analysis may have a bias towards higher quality submissions. However, we argue that the cholera dataset has provided a representative example reflecting globally relevant trends in data sharing and allowing scalable estimation of the day-after-tomorrow value proposition which will likely shape the value chains of pathogen genomics data for data mining, meta-analyses, collective intelligence and machine learning. We note that our framework does not assess the immediate value of microbial genomics for clinical management, disease surveillance, public health interventions, research, drug development, and other such applications realised by teams organising sequencing experiments.

In conclusion, with the looming risks of new epidemics and increasing value of data driven by advances in AI, the case for promoting and utilizing the deferred value of microbial genomic data is more compelling than ever before. Existing inequalities in genomic data production and analysis should be rebalanced and the collective value creation recognised and adequately supported in order to promote and sustain genomic data sharing. The current mismatch between growing the number of publicly available sequenced genomes and the lack of associated metadata significantly limits the opportunity for extracting deferred value from these genomes in full. The deferred value framework described here intends to incentivise prompt data sharing by maximising the value of sequencing accompanied with appropriate metadata with minimal time lag between sample collection and genome submission. The data value assessment should pave the way for the international mobilization of quality microbial genomic data for global health and knowledge discovery where shared data value creation is appropriately acknowledged and rewarded.

Methods

Data extraction

The NCBI databases were accessed through the Entrez interface using the R package “rentrez” version 1.2.3 (https://docs.ropensci.org/rentrez/). Candidate BioSamples were selected using the organism query “Vibrio cholerae[ORGN]” on the BioSample database, which returned 26,792 matches. The metadata was downloaded on 24 May 2024 and processed with our analysis scripts35. These were discarded unless they had a database link from their BioSample record to the assembly database. 10,110 BioSamples remained after this step and were retained for analysis, as listed in Supplementary Table S1. Genomic assemblies of these 10,110 Vibrio cholerae were downloaded using the NCBI dataset command line tool version 16.11.0 (https://github.com/ncbi/datasets).

Data normalization

The Entrez interface was used to extract all attributes associated with the analysis BioSamples. The attribute names were taken from the “harmonized_name” field to normalize conventions from different source databases. Several non-standard attributes were normalized where appropriate. Where a “sample derived from” attribute existed, any missing sample attributes were filled in from the source sample’s record. When the BioSample owner was recorded as “EBI”, the original owner’s name was taken from the “INSDC center name” attribute. A set of non-standard values denoting missing data (e.g., “Unknown” and “not provided”) were replaced with empty values across all attributes.

All attributes describing owners of the BioSample that appeared in the analysis set were annotated to distinguish between academic institutions (e.g., universities, biomedical research institutes and organizations) and microbiology service providers (e.g., hospital and public health laboratories). The list of all country names appearing in the “geo_loc_name” BioSample attribute in the analysis set was categorized according to cholera endemicity, defined as a reported disease burden in excess of 10 cases per 10,000 population21 (Supplementary Table S6), and membership on the International Monetary Fund’s World Economic Outlook advanced economies list (https://www.imf.org/en/Publications/WEO/weo-database/2023/April/groups-and-aggregates#ae, Supplementary Table S7). Note that the two categories of countries are mutually exclusive. The R code used to curate the data, and the categorizations of institutions and of countries, are reproduced in our analysis scripts35.

Sequence quality sub-score

The Entrez interface was queried to identify links from BioSamples to the SRA and assembly databases. Where multiple such links existed for a BioSample, the most appropriate entry was determined manually. The reported assembly length was compared to the length of the reference genome Vibrio cholerae strain RFB1636. The reported total base count from SRA was divided by the reference genome length (4,138,412 bases) to obtain the theoretical average sequencing depth. These values were then used to determine the sequence quality sub-score according to Supplementary Table S2, using genome depth recommendations for pathogen categorizations37 and the limits utilized by the genome size check metrics of NCBI GenBank for size assessment.

Contextual metadata sub-score

Metadata items were mapped to specific attribute names from standard upload schema or from freeform values appearing in the analysis dataset. For example, the sample source metadata item was mapped to attributes including “isolation_source” and “host_tissue_sampled”. A metadata item was considered to be provided for a BioSample if at least one of the corresponding attributes had a value. The presence of a link to PubMed was determined by checking for database links from BioSample to PubMed, and from BioProject to PubMed for all BioProjects linked to a given BioSample. These metadata items were then used to compute the contextual metadata sub-score (Supplementary Table S3). The details of this analysis are recorded in the R scripts in supplementary data.

Novelty sub-score

The novelty of sequences was determined by assessing BioProject attributes according to Supplementary Table S4. The “name,” “title,” and “description” attributes were used along with the number of linked BioSamples and the proportion of these that were reported as V. cholerae. Where a BioSample was linked to multiple BioProjects, the highest novelty sub-score was used.

Timeliness sub-score

The timeliness sub-score was computed according to Supplementary Table S5 using the difference between the year of the BioSample submission date and the year of the “collection_date” BioSample attribute.

Value score computation

We defined the deferred value of shared data within a single genome (V) as a function of the sequence quality sub-score (S), novelty sub-score (O), contextual metadata sub-score (M), and the inverse of the timeliness sub-score (T). Each component takes an integer value between 1 and 3, with higher scores for S, O, and M indicating “better” genomes, and lower values for T indicating more timely submissions. We estimated V as

$$V=\frac{3}{80}\left(\frac{{SOM}}{T}-\frac{1}{3}\right)$$

The constant terms adjust the range so that V takes values between zero and one.

Dataset value index computation

The economics approach to scientific data depends on the availability of data value estimates38. The value index (\(\bar{V{\prime} }\)) of a collection of \(N\) genomes represents their aggregated value. It is the average value score of the individual submissions, adjusted by genomic diversity:

$$\bar{V{\prime} }=\mathop{\sum }\limits_{i=1}^{n}\,\frac{{V}_{i}}{N}\left(1-\mathop{\sum }\limits_{j=1}^{R}{\left(\frac{{n}_{j}}{N}\right)}^{2}\right)$$

The sum term inside the parentheses is the Simpson index of diversity: for each of the R distinct multi-locus sequence types (MLST) in the dataset, nj is the number of sequences with that type. The term in parentheses therefore represents the probability that two sequences drawn from the dataset at random (with replacement) have different sequence types. The value index accordingly penalizes datasets with low genomic diversity.

MLST and in silico serotyping

Multi-locus sequence typing (MLST) was performed on the 10,110 assemblies as input using mlst version 2.23.0 (https://github.com/tseemann/mlst) with default settings and the Vibrio cholerae scheme (--scheme vcholerae). A minimum spanning tree was generated from the allele calls of 10,106 genomes using GrapeTree version 1.5.0 with the MSTree V2 algorithm (https://github.com/achtman-lab/GrapeTree). Four V. cholerae genomes were omitted because they had no MLST allele calls.

Serotypes were inferred from the same genomic assemblies using ABRicate version 1.0.1 (https://github.com/tseemann/abricate) with the CholeraFinder database (https://bitbucket.org/genomicepidemiology/choleraefinder_db). We considered hits only to three nucleotide sequences from this database: the ompW, rfbV-O1 (O1 serogroup specific genes) and wbfZ-O139 (O139 serogroup specific) gene markers. ABRicate was configured to use a minimum DNA identity threshold of 95% and minimum DNA coverage threshold of 60%. Genomes were inferred to be either O1 or O139 if they had a hit for the corresponding serogroup-specific gene, whether or not there was a hit for ompW. Other genomes with only a hit for ompW were inferred to be non-O1/O139 V. cholerae, while genomes with no hits to any of the three markers were undetermined.