Main

The study of the history of life on Earth is inherently multidisciplinary and conducted at various scales from local to global. This scientific inquiry draws from geology, biology, chemistry, archaeology and mathematics, among others, to breach human temporal perspectives, reconstruct ancient ecosystems, investigate the drivers of biodiversity and forecast how life will respond to today’s changing environments1,2,3. The fossil record is essential for understanding biodiversity and Earth system processes operating at timescales beyond the twentieth and twenty-first century window of instrumental observations. The past also provides examples of Earth system states with instructive analogies to the societally novel climates that are now emerging4,5. From their very beginning, palaeontological databases (see ‘Glossary’ in Supplementary Table 1) have played pivotal roles in enabling the field to scale up from site-level studies to global-scale research. These databases were founded by scientists seeking to address questions beyond the scope of any individual palaeontological dataset, including identifying global mass extinctions and their roles in macroevolution6 and the earliest evidence of climate-driven species’ range shifts and ecosystem transformations7,8. The subsequent migration of palaeontological databases to open-access online platforms and data systems (encompassing the database, its system for community governance and data curation, and any associated software services) increased their accessibility and amplified their impact by enabling new questions to be explored, broader collaboration and reproducibility9,10.

Today, openly accessible, community-run data systems function as collective archives for scientific data and knowledge about the history of life on Earth11 (Fig. 1). These databases are invaluable for quantitatively reconstructing ancient ecosystems12, tracing evolutionary pathways13, studying climate- and human-driven eco-evolutionary dynamics at continental to global scales5,14,15,16, and predicting future biological and geological changes—or at least assessing the limits to predictability in an increasingly novel world17,18. By integrating these palaeontological databases with other open data systems, scientists can tackle increasingly complex, multifaceted questions that are top priorities in global change research3,19,20.

Fig. 1: Palaeontological information in an Earth system context.
Fig. 1: Palaeontological information in an Earth system context.
Full size image

From left to right, planetary or global-level information can be used to understand tectonic processes, climate and landscape evolution, and eco-evolutionary processes across timescales ranging from the present to billions of years. Outcrop- or borehole-level data provide local- to regional-scale time series that can be used to reconstruct climate, geochronology (age), sea level fluctuations and community dynamics. Finally, specimen-level data are the foundational unit in palaeobiology for analyses of, for example, taxonomy, biotic interactions, geochemistry, functional ecology, behaviour and taphonomic processes. Credit: Science Graphic Design.

Representing developers, leaders, curators and users of 15 community-run palaeontological databases (Supplementary Table 2), we review the current landscape of palaeontological data systems to assess the volume, variety and value of data held in these community-curated, openly accessible databases, their diversity dynamics and longevity, the challenges faced and the opportunities for sustainable growth and scientific discovery. We close by providing recommendations for continued investment from researchers, maintainers, developers and funders.

An overview and history of palaeontological data and databases

Key concepts

Palaeontology aims to reconstruct the history of life across the broadest possible range of spatiotemporal scales and throughout the geological record (Fig. 1). Here, palaeontology encompasses closely related fields, including but not limited to palaeobiology, biostratigraphy and palaeoecology. As our collective understanding of geological processes evolves, new scientific questions emerge, and our interpretation of the fossil record is updated. This, in turn, affects our understanding of the processes that we infer from it and drives new primary-data collection campaigns (for example, fieldwork) and the reinterpretation and reanalysis of existing data. Examples include taxonomic reidentification of old fossils following new finds21, redating of core samples and refinement of the geological timescale using newer and improved methods and data (for example, ref. 22), re-interpretation of environmental/depositional contexts (for example, ref. 23) and incorporation of palaeobiogeographic patterns into tectonic models (for example, ref. 24).

Palaeontologists work with two primary forms of data: ‘fundamental data’ and ‘processed data’. Fundamental data are direct observations and sampling of the sedimentary record and fossil specimens within these sediments. Examples of fundamental data include geospatial locations, physical samples, multimedia recording, counts and geochemical analysis. When these fundamental data are subject to further interpretation, such as through taxonomic study and analyses of morphology, preservation and biotic associations, they are translated into processed data. For example, within database structures, age controls (for example, radiocarbon dates) are fundamental data, and age-depth models (used to estimate the age of different depths within a sediment core or stratigraphic profile) are processed data and are frequently revised. Although fundamental and processed data exist on a continuum, whenever possible, palaeontological databases should maintain the strongest links to fundamental data and the associated physical samples or specimens. A focus on fundamental data and well-established provenancing is essential for reproducibility and resampling efforts (for example, when palaeontologists remeasure a fossil or reassess its taxonomic identity). A focus on fundamental data also reduces database maintenance costs, because of the frequent revisions associated with processed data. Lastly, good provenancing can ensure against corollary risks such as ‘data cannibalism’25,26 when databases are used as data sources for other, secondary databases without proper attribution and dataset-level provenancing, which can violate the standard CC-BY licences that accompany most open-access data resources.

Database development history

The past—first-generation research databases

First-generation compilations of palaeontological data usually were launched with a specific research question or other purpose and focussed on the collation of processed data. For example, the collation inferred temporal (that is, stratigraphic) distribution of fossil taxa using harmonized taxonomic lists across sites, which are the minimum requirement for assessing the history of biodiversity (for example, refs. 27,28) and the shifting distribution of taxa across space and time8,29. These were initially collated as physical repositories (for example, the John Williams Index of Palaeopalynology30) or as offline digital entities (for example, Sepkoski’s Compendium28,31 and the first version of the Neptune Sandbox database (NSB)30). These first-generation databases were often built either by individual scientists over their careers or by small research teams.

The present—second-generation multipurpose and community data systems

As the field advanced, palaeontologists gained further understanding of the various factors that distort the structure of the fossil record (for example, ref. 32), new research questions emerged (for example, reconstruction of past biomes and terrestrial carbon sequestration33), and palaeontologists developed new quantitative methods to address emergent questions. These, in turn, led to new efforts to reanalyse existing databases. For example, in deep-time biodiversity studies, the field progressed from recording observed first- and last-appearance dates of taxa31 to the recording of fossil occurrences from the entire stratigraphic record (for example, refs. 13,34). As the breadth of questions increased, second-generation data systems (for example, Paleobiology Database (PBDB)34 and Neotoma10) also began to store an increasing variety of fundamental data types, including the geographic coordinates of fossil sites, taxon abundance and traits, stratigraphic position, lithological characteristics and environmental covariates.

In parallel to this expansion of database scope, the leadership and development of these databases increasingly shifted from a few individual experts to community-curated data systems. For example, in Quaternary palynology, individual efforts to build databases and map continental-scale plant distributions for North America and Europe8,29 expanded to continental-scale databases around the world, each with their own data leaders and stewards10,35,36,37.

The data structures of current, second-generation databases vary substatially, reflecting their founding aims and user communities. As examples, PBDB was originally developed to enable, among other things, sampling-standardized estimates of Phanerozoic diversity13; NOW (New and Old Worlds database of fossil mammals) focused on Cenozoic mammal macroevolution38; Neotoma was designed to study species range shifts during the Quaternary glacial–interglacial cycles across multiple taxonomic groups10,36,37,38; and the Geobiodiversity Database (GBDB) was designed to support high-resolution stratigraphic data by linking fossil occurrences to detailed geological sections39.

All these databases continue to grow in scope and incorporate new kinds of data. As new questions emerge and data continue to diversify and increase in accessibility25,40, the range of scientific applications of these second-generation databases far surpass their original scope and yield input for thousands of scientific studies (see ‘Database use’ in Supplementary Data). For example, PBDB occurrence data have been used for climatic modelling41,42, landscape evolution43 and palaeogeographic models24. Similarly, NOW data have been used to study macroevolutionary expansion44 and Neotoma data for reconstructing past climates45, constraining past land cover dynamics and the terrestrial carbon cycle46, and documenting cross-continental species invasions47. The scientific utility and applications of these databases thus continue to grow and diversify, as do the databases themselves.

The near future—from databases to third-generation data systems

Palaeontology is poised for its next transformative phase, in which second-generation databases will better integrate with each other and with other components of the palaeontological, Earth and life sciences data infrastructures, to address more integrative, cross-disciplinary and multiscalar questions. The transition to the third-generation database systems has already begun, with cross-database integration a core focus of backend development, using, for example, the linking capabilities provided by new data structures such as LinkML (https://linkml.io/). Other efforts are focusing on improved efficiency and more sustainable codebases through modular design48,49,50. The development of integrative platforms, such as Deep-time Digital Earth51, and the continued growth of existing databases to support new data types, such as ancient environmental DNA52, are striking movements towards third-generation databases.

These efforts towards integration and efficiency will enable new scientific questions to be answered at increasing power. For example, in the Big Questions in Paleontology Project53, representative questions include ‘How do external environmental drivers (for example, plate tectonics, global temperature and sea level) influence the structure of biological systems at different spatiotemporal scales?’, ‘How does the prevailing climate state experienced by species and communities influence their response to perturbation?’ and ‘To what extent are the phases of events (for example, collapse and recovery) during extinctions consistent across different biotic crises?’ Addressing these integrative questions requires scalable, connected data that capture, for example, phenotypic variation among individuals in a population, in conjunction with high-stratigraphic-resolution, palaeoenvironmental and specimen-level information. These scientific needs demand further advances in how palaeontological data are reported, structured, integrated, managed and sustained. Cross-institutional aggregation of museum specimen information into iDigBio54 and the Global Biodiversity Information Facility (GBIF55) provide models of how biodiversity databases can grow and be enhanced by ever-improving biodiversity data standards, such as the Darwin Core56 and ABCDEFG57, featuring a stronger focus on available metadata58.

With the growth of these databases has come an awareness of their importance and impact beyond simply answering scientific research questions. Careers of an entire generation of scientists are now influenced by publicly accessible, interoperable data, and access to international, high-quality data has led to the rise of quantitative subdisciplines in palaeobiology59,60. Similarly, in allied fields such as geochemistry, the advent of open scientific databases PetDB and GEOROC has enabled the rise of statistical geochemistry61. At the same time, new concerns have arisen about whether these databases encode and perpetuate past and present inequities, such as parachute science62, and how best to reduce these inequities to truly fulfil the deeper mission of these databases to ensure democratized data access for all61,62,63,64,65. To address these issues, the concept of community governance of palaeontological databases must be further broadened to include additional voices and to develop more effective, context-sensitive strategies that address issues of access, reciprocal research and data equity51,53,66,67,68.

Landscape survey of the current state and valuation of palaeodata

Survey and assessment of palaeodata

An online survey of available palaeontological and Earth science databases was conducted using search terms in multiple languages to identify ‘community-run’ databases (Supplementary Table 4). Community-run databases were required not to be affiliated with any state governing body, including state-funded museums or geological surveys, and are considered ‘open access’ by virtue of making their data freely available for general use. The period of activity was identified by the first publication of the database in the peer-reviewed scientific literature, and its endpoint was identified through the last update to the web service, data repository and/or latest published article. Among the 171 palaeontological and Earth systems databases identified, 118 were open access and community-run, and their extinction rate, origination rate and diversity dynamics were assessed (Fig. 2).

Fig. 2: Diversity dynamics of 118 community-developed palaeontological databases from the 1970s to 2024.
Fig. 2: Diversity dynamics of 118 community-developed palaeontological databases from the 1970s to 2024.
Full size image

a, The range-through richness of databases by year. b, The origination rate of databases through time, indicating areas of peak activity for novel database development between 1995 and 2005. c, Diversity of databases as a function of years active (that is, database survivorship) showing the loss of >80% of database diversity by 10 years of activity. d, The rolling mean per-capita extinction rate of databases as a function of years active since inception, with peaks at 5, 15 and 25 years of activity.

Source data

We assessed the replacement value of the data stored in three databases: PBDB, GBDB and Neotoma. We followed Thomer et al.69 and estimated conservatively a value of US$3,000 per collection (Methods). This valuation is clearly an underestimate, not only because it does not cover all costs, but also because it assumes that the sample localities are still accessible and have not been destroyed by human land use or natural processes such as earthquakes. The data from inaccessible locations are therefore irreplaceable and priceless70,71,72. Research effort, storage, maintenance, curation and expertise were not calculated, resulting in conservative values that do not cover the entire cost to replace the extant data. They additionally do not cover the costs of labour, server hosting and infrastructure development that go into setting up and sustaining databases. Furthermore, results do not cover the article processing costs to publish a scientific paper (mean US$2,300)73 or the value of papers published (estimated at over US$5,000 per item)74.

Historical trends in results

Based on our web search, database origination rates peaked in the 1970s and 1990s, with a tertiary peak in the 2010s (Fig. 2). Nearly 50% of databases (n = 118) became inactive within just 5 years, and fewer than 15% survived a full decade. Only a rare 5% remained active for over 15 years (Fig. 2). This 5-year interval coincides with the standard competitive funding cycles of many large research grants from wealthy international unions or countries with a high gross domestic product, such as through the European Research Council or the US National Science Foundation, respectively. This means that, after 5 years, up to 65% of value-added data effort—representing years of data aggregation, data harmonization and cleaning, technical development and scientific labour—is left unmaintained and sometimes inaccessible.

For example, recent attacks on the cyber infrastructure of the Museum für Naturkunde Berlin resulted in the loss of community access to the NSB. The NSB was intermittently funded (16 funded years since 1990) and maintained by an individual expert (see ‘Curator review’ in Supplementary Data). The NSB held hundreds of thousands of marine plankton microfossil species that are used to research marine community responses to climate change and has been a data contributor to other databases, including BioDeepTime75, Microtax76, GBDB39 and Triton77 among others. This attack not only impacted a key resource for microfossil taxonomists, evolutionary (palaeo)biologists and palaeoceanographers, but the data provenance of the dependent databases was compromised. The lack of funding and dedicated technical support resulted in insufficient failsafes at the museum. Instead, through community activity, external versions of the NSB—for example, those hosted on Zenodo78—are contributing to database recovery, further highlighting the value of community contributions in sustaining data resources.

Although some database development efforts are intended for short-term use and do not assume database longevity, the loss of these databases is not just a scientific concern; it also represents a substantial economic waste (Fig. 2 and Table 1). The cost of allowing valuable data infrastructure to degrade is not conceptual but quantifiable and substantial. The best-case scenario for at-risk databases, as illustrated by strategy 2, involves integration with larger data systems—an example being the current incorporation of 34 constituent databases into Neotoma. The data protected and expanded by Neotoma were recently estimated to cost over US$1.5 billion to replace69. Despite the valuation and proven utility to the community59,68,79, even long-lasting success stories such as Neotoma are at risk due to reliance on grant-based funding. Consequently, sustainable data infrastructure requires treating data contribution not as an obligation, but as a scholarly practice, and databases should be thought of not as products, but as commons sustained by collective stewardship. Volunteer labour in data contribution and backend development is often invisible and rarely credited, yet it is what has kept these long-lived databases active.

Table 1 The value of the samples and collections (sites) stored within three active palaeo databases in US dollars

Databases achieving longevity exceeding 20 years tended to use one or more of three distinct strategies. First, some databases have relied on dedicated volunteer maintenance by one or two individuals with free or cheap institutional hosting support (for example, NSB). This solution can extend database longevity and sustainability through ties to the career of individuals but faces challenges when those individuals retire or shift positions. Second, some databases have enhanced their sustainability and achieved economies of scale by joining together and integrating into larger cooperative structures. For example, Neotoma was first formed as the union of FAUNMAP, the Latin American Pollen Database and other continental-scale databases, and new Constituent Databases continue to form and join Neotoma to leverage its data model and services10,37. Third, direct community-driven data contribution: PBDB has grown through primary data uploads from hundreds of volunteer contributors (see ‘Curator review’ in Supplementary Data). Within community initiatives, both the cooperative-database model and the direct-volunteer model leverage international research communities to build and grow their databases, while the first three strategies all rely on competitive grant funding to sustain and develop their data infrastructure. This community effort was also supported by the introduction of novel funding systems, such as the US National Science Foundation (NSF) Geoinformatics programme, which shifted its support for scientific databases from a traditional 3-year model to a development-dependent model. This new model includes a 3-year ramp-up stage for new resources, a primary database support stage lasting up to 10 years (divided into 3- to 4-year competitive awards) and a 3-year ramp-down stage. Database longevity is thus linked to both sustained community investment in volunteered time, experience and data contributions and to new funding models that support sustained, community-led database growth.

Towards the third generation of palaeontological databases

We present here a series of actionable recommendations to address the existing structural and community challenges within the palaeontological and Earth science data landscape (Table 2). To address data fragmentation and structural redundancy in databasing effort, the immediate priority is to maximize the value of existing services while laying the groundwork for long-term solutions.

Table 2 A roadmap to sustainable funding

Modular, interoperable and community-led data systems

The scientific community and governing bodies (for example, funders) must move away from the current trend of creating standalone databases that are not interoperable and either too small or too disconnected from their communities to achieve longer-term sustainability. Instead, they must design for integration and community engagement, to break the cycle of effort and loss. While broader challenges around data infrastructure are often shaped by political and institutional forces beyond the control of individual researchers, the scientific community can take meaningful action through improved data practices64,80,81,82. Examples such as Neotoma, NOW and PBDB, which have remained active for over 15 years and continue to serve global communities across disciplines, demonstrate the efficacy and resilience of collaborative stewardship9,37,83. Notable features of long-lived databases include international collaboration in data stewardship, critical community contributions by way of volunteered data, and efficient data ingestion (see ‘Historical trends in Results’ section; see also ‘Curator review’ in Supplementary Data). However, databases and related resources such as the Biodiversity Heritage Library (https://www.biodiversitylibrary.org/) remain vulnerable to ‘extinction’, such as through cyberattacks, but more commonly due to funding termination.

By prioritizing interoperability, modularity and close engagement between databases and their supporting communities, we can build a resilient and pluralistic community of data systems that safeguard multiple dimensions of scientific data and ensure its continued relevance to scientists, external stakeholders and the general public66,84. To this end, we recommend the transition to a decentralized modular data network (Fig. 3), where core components, such as those responsible for taxonomy, stratigraphy and specimen provenance, are built with a flexible scope and in a way that minimizes duplicative curatorial effort. In this vision, each part of the scientific community would be responsible for curating a specific area of data and knowledge (for example, age models and time inferences; stratigraphy and lithology; taxonomies and phylogenies; organismal abundance and occurrence; fossil morphologies; and ecological traits), and these modular, interconnected systems would integrate data and knowledge across these domains. This system would function by transitioning from the fragmented and uncoordinated data landscape (Fig. 3a) to pooled, pluralistic frameworks66 (Fig. 3b). Pluralistic approaches to data pooling maintain domain independence and flexibility, permitting field-specific misalignment (for example, the unit differences in terminology and grouping seen between core-based micropalaeontology and global macrofossil biogeography in terms of spatial and temporal binning; Fig. 3b). Modules within this system serve as interlocking elements, offering researchers the foundation to develop extension structures necessary for addressing novel scientific questions within a broader, connected data landscape (Fig. 3b). For example, to answer questions about fossil biotic interactions (for example, BITE85), a new data structure is required, developed specifically to tie one or more biotic interactions and the organisms to a rock specimen. This novel database element would then be integrated with pre-existing core elements such as taxonomy, stratigraphy and geography (Fig. 3b), meaning the only new element to be constructed is the one that captures explicitly biotic interaction data. This approach saves time on database construction, reduces duplicative effort, ensures interoperability and safeguards against the loss of data from novel databases. In this way, the data from ‘extinct’ databases can be conserved and reintegrated, either by adding them to existing core modules or by creating new modules. The suggested solution mimics the general tendency of some corporations that move from large monolithic applications to interconnected microservices to meet the demands of scalability and a fast development cycle84,86, and is particularly suited to scientific research that is globally distributed in nature48,49,50,80.

Fig. 3
Fig. 3
Full size image

Graphical representation of the current database landscape and a possible idealized scenario for the structure of the palaeobiological database landscape. a, The current data landscape consists of disparate databases of varying size, scope and resolutions that are connected through limited links. This means that core elements (such as fossil localities) each have independent and therefore repeated solutions for each database. b, An idealized database system consists of a decentralized network of interconnected independent subsystems (nodes) that benefit from collective standards and potential sharing of financial and technical support. Core data elements, such as fossil localities, are accessible as a standardized module that each database uses when developing domain-specific data structures (for example, for phylogenetic matrices or stratigraphy). A central support system, without compromising data sovereignty, can be put in place to decrease funding volatility and ensure network-level standards and integrity. The maintainers of individual databases can then apply for specific external grants and technical support to develop tailored data solutions to address novel scientific questions. Credit: Science Graphic Design.

To realize their full potential, third-generation databases must, whenever possible, maintain direct links to physical specimens and samples (for example, through the International Generic Sample Numbers; https://ev.igsn.org/), originators and users (for example, persistent identifier through ORCID; https://orcid.org/), and usage (for example, DATACITE for DOI mining; https://datacite.org/), while also improving linkages to other databases (see ‘The near future—from databases to third-generation data systems’ section). Museums, research institutes and public collections are a foundation of this system, providing crucial metadata that tie scientific conclusions based on digital data to the primary physical evidence68,87,88. Strengthening links between physical specimens and their digital representations will ensure long-term data accessibility, foster interdisciplinary research and empower the next generation of large-scale palaeontological analyses by upholding scientific transparency and rigour48,65,89.

Developing application programming interfaces (APIs)—which enable one software program to request services or data from another without needing to know the other system’s internal workings and that adhere to open science standards19,90,91,92—is crucial for ensuring seamless exchange of information between data systems, regardless of their underlying technologies. In addition, data harmonization tools93 can streamline data integration and scientific workflows by automatically reconciling differences in data formats, units of measurement and terminologies. For example, the fossilpoll workflow94 (hope-uib-bio.github.io/FOSSILPOL-website/en/index.html) pulls data from Neotoma, harmonizes the age-depth models and builds harmonized taxonomic names lists. These workflows create opportunities to distribute effort, allowing scientists outside the database leaders/curators to add value while establishing strong provenance between these downstream research analyses and the databases. Furthermore, such tools can leverage artificial-intelligence and machine-learning algorithms to help data stewards identify and merge duplicate records, standardize taxonomic names and align stratigraphic information, reducing the manual effort required for data integration. Because of the complexity of fossil data and the implicit knowledge often embedded in palaeontological datasets, we recommend that analytical and curatorial workflows use human-in-the-loop approaches rather than fully automated systems, to avoid ‘garbage in, garbage out’ situations and ensure accuracy and reliability.26.

Financial support

Using palaeontological data as an example, we propose a path forward for sustainable development, funding and stewardship to safeguard community-built scientific data systems for future generations. While we focus here on open digital resources for the democratization of science, investment in such resources must be accompanied by stronger linkages to, and explicit support for, museums and physical repositories64,65,95,96,97.

Long-lived databases have been developed and maintained through a combination of sporadic funding, international support and unfunded volunteer/service work61,69,98. The persistence of these databases through all this financial precarity is a testament to their importance and the work of many scientists to keep them going. Investing in sustainable, modular data infrastructure not only enhances the longevity, accessibility and utility of scientific data, but also protects the immense financial and intellectual investment already made61,98. Funding is essential for ensuring that community-curated data continue to inform cutting-edge science well into the future.

Besides optimizing the use of already acquired funding, long-term sustainability hinges on moving beyond short-term, project-driven funding models (Tables 2 and 3) such as those offered by the NSF Geoinformatic programme model. Advocating for policy support at institutional, national and international levels is required to create an environment for these systems to thrive97. Network-level integration provides a means to ensure continued relevance, usability and return on investment beyond the end of a research project’s funding cycle69,98.

Table 3 Recommendations for the sustainable development of community-developed data resources and the related benefits derived from their implementation

Engaging policymakers and funding agencies in discussions about the importance of Earth science and palaeontological community data networks can help to secure the necessary support and resources (for example, the USA’s Geoscience Congressional Visit days). Core infrastructure funding, akin to utilities for the scientific community, should be secured through national and international bodies, ensuring that databases are treated as essential research infrastructure (for example, the Australia Data Research Commons (https://ardc.edu.au/) and the Chinese National Natural Science Fund Key Basic Research Infrastructure programme (nsfc.gov.cn/english/site_1/funding/E1/2024/06-12/364.html), which supports geo-data infrastructure). Within our proposed funding roadmap (Table 2 and Fig. 3), we recommend demonstrating the economic, societal and scientific value of open data through public–private partnerships and cost–benefit analyses, approaches already proven effective in initiatives such as Ozboneviz97. Ultimately, we recommend the establishment of a dedicated international non-profit organization, akin to the European Organization for Nuclear Research (CERN) or GBIF, which would advance the financial sustainability of the life and Earth science data landscape.

These two organizations, among others, provide a useful template for palaeontology and Earth systems science more broadly. The success of the GBIF, for example, lies in its clear governance, strategic coordination and stable funding model—elements that palaeontological and Earth science databases are yet to fully achieve (Fig. 3). GBIF operates as a community-governed, multinational consortium supported by member countries, each contributing financially and through data provision, underpinned by a robust strategic framework that ensures long-term stability, interoperability and open access55,99,100. Its structure—from local and national nodes to global coordination—promotes accountability and sustained collaboration, while its standards (for example, Darwin Core56) have become a foundation for data integration across the life sciences. The proposed framework for palaeontology (Table 2 and Fig. 3) builds upon this foundation while simultaneously acknowledging that pluralist approaches to establishing and formalizing collaboration are essential. Stable financial support is required to bridge the domain-specific knowledge and data structures required to hold information from the multiple subdisciplines of palaeontology (for example, palaeobiology, biostratigraphy, ichnology and palaeobotany) and related Earth sciences (for example, sedimentology, geodynamics, geochemistry and climatology). This inclusivity ensures that the framework not only supports technical interoperability but also fosters equitable participation and long-term sustainability across the full spectrum of palaeontological and Earth science research (Fig. 3). For example, a recent review of geochemical databases highlights similar trends in data lifecycles, while highlighting that geochemical data require tailormade data structures to host and develop them61. In understanding the data requirement, and if led by those creating and using the data, a GBIF-like framework (that is, formalized international partnerships, strategic cooperative leadership, modular infrastructure and clear attribution systems) can secure sustainable data management, enhance interoperability and ensure the long-term preservation and growth of palaeontology and Earth science’s collective digital resources.

Community governance and goals

We propose a phased, community-guided transition towards a sustainable, interconnected, and explicitly modular data infrastructure—one that is grounded in equitable practice and ensures proper attribution40,64,91,101,102,103. As artificial intelligence, large-scale web scraping and automated data aggregation become increasingly common tools, the palaeontological community must actively shape how its openly accessible data are structured, cited and reused25,40,98,102. A modular and well-governed framework will allow us to respond nimbly to these technological developments while preserving the integrity and provenance of our data. Central to this vision is strong, inclusive community governance—led by the researchers, data stewards and institutions who know the data and needs of the researchers best66. By harmonizing efforts and redistributing responsibilities through open consultation, we can build an equitable and future-ready infrastructure that supports both innovation and accountability in palaeoscience.

Promising steps are already underway. Initiatives such as the ARC Centre of Excellence for Australian Biodiversity and Heritage (CABAH; epicaustralia.org.au) exemplify how community-led, transdisciplinary frameworks can successfully balance Indigenous knowledge systems, biodiversity and palaeodiversity data, and open infrastructure. In 2023, CABAH produced 127 journal articles and welcomed over 60,000 attendees to its public programmes and events. CABAH’s approach is collaborative, bringing together researchers, Indigenous communities, industry and policy partners. This momentum is furthered by ensuring that decisions around standards, attribution and data validation are made through inclusive consultation with a broad cross-section of the community, including historically underrepresented groups and the global majority.

Community buy-in for data attribution and validation will facilitate community trust in open data resources90,91,92. True integration goes beyond technical aspects and requires active collaboration between scientists and technical experts from varied disciplines (Table 3 and Fig. 3). Establishing interdisciplinary data standards, training programmes, research teams and projects can facilitate this collaboration (Table 2). Through this effort, we can develop common research frameworks and questions to guide data integration efforts, aligning the objectives of different disciplines63.

Conclusions

Palaeontological data systems are critical resources for the advancement of Earth system research and the training and development of Earth scientists. By committing to the development and maintenance of decentralized, interconnected, modular data systems, we can address pressing questions about the history of life on Earth, ensure the longevity of our shared resources and create a more interconnected scientific community. This effort is already underway, building on the success of first- and second-generation data systems that have advanced our understanding and technical capacity. Developing integrated support systems will protect, sustain and enhance these valuable community-driven data resources. Together, these recommendations align structural reforms with scientific needs and community values. The path forward requires a collective effort, sustained funding and a commitment to collaboration, ensuring that palaeontological data remain valuable resources for future generations.

Methods

Database survey

To assess the temporal dynamics and sustainability of palaeontological databases, we systematically searched Web of Science and Google Scholar between November 2024 and March 2025. Search terms combined multilingual instances of ‘database’ (Supplementary Table 4) with ‘palaeontology’, ‘geology’, ‘fossil’ and ‘Earth science’. Web of Science searches were restricted to ‘Physical’, ‘Chemical & Earth Sciences’ and ‘Life Sciences’ categories. Languages were selected on the basis of distribution by official or co-official status: English, Spanish, Arabic and French, following country counts from the South Australian Government104. Additional major languages (for example, Mandarin) were searched for up to five pages, with searches terminated when no new non-governmental databases were identified.

We inspected the first 10 result pages per aggregator (100 results for Google Scholar; 250 for Web of Science). Each result was examined to distinguish presentations of new databases from studies citing existing databases. Results were recorded following standardized definitions (Supplementary Table 5).

Temporal and funding data

Inception or start date (Supplementary Table 5) was defined as the year a database became publicly available, determined by the earliest of associated publication date or website launch. End date (last reference/update; Supplementary Table 5) was the most recent documented update, identified hierarchically from: (1) database website update information, (2) publications documenting database state, (3) most recent citation in scientific literature or (4) last confirmed year of public accessibility. Databases with identical start and end dates were included in diversity metrics but excluded from longevity and extinction analyses as they represent point occurrences rather than temporal spans. Funding information was recorded from database websites or associated publications when available. Records lacking start or end dates were omitted from analyses.

Database analysis

We identified 171 palaeontological and Earth science databases. After removing governmental databases (see rationale in Supplementary Table 5), 125 remained, of which 118 met the inclusion criteria (see ‘Richness’ in Supplementary Data). Summary statistics on database duration excluded same-year databases (n = 30), yielding 88 databases with temporal ranges (Supplementary Tables 6 and 7). An additional analysis excluded the top 15% longest-lived databases (approximately 25 years; Supplementary Table 6) to examine diversity dynamics, because the diversity and number of databases through time and associated changes affect the data landscape and its stability (Fig. 2), representative of typical community-maintained databases.

The per-capita extinction and origination rates were analysed using a rolling mean of year-to-year database activity, while the sampled-in-bin diversity used an extended decadal time series to account for boundary conditions.

To mitigate edge effects, we extended end dates of active databases to 2027 and truncated analyses at 2024. This approach addresses the pull-of-the-recent bias affecting terminal rates. The time series start was extended to the 1970s to include early static datasets that predate digital database proliferation.

Analyses were conducted in RStudio (4.5.0)105,106 using DivDyn107. To address the question of database diversity dynamics, we calculated:

  • Richness: total number of databases active within a time (divSIB);

  • Diversity by duration: distribution of databases by years active (divRT);

  • Origination rate: the rate at which new databases are established per year (2-year rolling mean; PC:oriPC);

  • Extinction rate: the rate at which databases cease activity per year (2-year rolling mean; PC: extPC).

Rolling means used a 2-year window to smooth interannual variation. All raw data, including point occurrences (same-year databases), were included in rolling mean calculations to capture complete database origination dynamics. The following metrics were considered in both raw and rolling mean treatments for origination, extinction and diversity used in Fig. 2: sampled-in-bin diversity, range-through diversity, per-capita extinction and per-capita origination. A full list of 12 metrics (Supplementary Table 8) was calculated.

Author survey on database maintainers, curators and data contributors

The authors of this paper, who were also database maintainers and/or developers, volunteered information about the backend, data volume and support structures (Supplementary Table 2). The authors presented data across 68 categories, including ‘History and funding management’ (7 categories), ‘Scope’ (3 categories), ‘Software and maintenance’ (16 categories), ‘Data contained’ (5 categories), and ‘Entity feature’ coverage (37 categories; see ‘Curator review’ in Supplementary Data). These descriptions informed benefits and recommendations (Tables 2 and 3) and present a clear synthesis of the variability in database structure and maintenance. The provided database ages were incorporated into the database survey (Fig. 2), in addition to funding and technical support summaries.

The citation count for each database was also requested from the database maintainers. PBDB was selected for full consideration, while Neotoma, the Geobiology Database, Triton, Neptune and NOW are present for completeness (see ‘Palaeodatabase publication products’ in Supplementary Data). The published literature (>1,800 papers) that cited PBDB as a data contributor were each tagged using 15 categories (palaeobiogeography, diversity, taxonomy, morphology, phylogeny, palaeoecology, environment, taphonomy, palaeoclimate, conservation, geochemistry, sedimentology, sedimentology, stratigraphy, evolution and other) to capture the diversity of topics PBDB data are used for. Owing to citation practices, the number of formal citations gained by PBDB is notably lower than the citing literature or ‘mentions’ the database when querying aggregators (>34,000 from Google Scholar, February 2025).

Methods for financial valuation

We used a financial valuation framework69 on the data volume that was provided either by the database maintainers (see ‘Curator review’ in Supplementary Data) or the most recent version of the database as of June 2025. The rationale valuation centres on the cost of replacing the data assuming only labour, expertise and institutional overhead are required69. The rationale also assumes that the data can be collected again, that the sites are still accessible, and that equal quality specimens can be obtained. Within palaeontology and Earth sciences, this is often not the case. We elected to focus on only two of the options: sample value (US$150) and site value (US$3,000).

Additional costs for publication, data hosting, hiring database maintainers and developers, and curatorial labour were not included in the valuation (and also were not listed in ref. 69 the valuation formula.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.