Background & Summary

Olfaction is critical to the enjoyment of food, the avoidance of danger, emotional memory, and social interaction. Yet 150 years after Helmholtz and Young first developed a working theory of color vision1, and 100 years after Alexander Graham Bell asked, “Can you measure the difference between one kind of smell and another?”2 we do not yet have a theory that relates chemical features of molecules to neural responses or perception. The design and interpretation of future olfactory neuroscience experiments would greatly benefit from a better understanding of olfactory psychophysics. What is the relationship between the physical properties of the stimulus and the percept it evokes? How do the perceptual properties of monomolecular odorants blend when mixed? What is the size, shape, and structure of odor space? What neural features represent olfactory perceptual properties? There are many datasets and models probing these questions, and yet these are still all open questions in olfaction. By curating and aligning datasets, Pyrfume enables a wide range of specific hypotheses about olfactory perception to be tested against a variety of experimental data. By illustrating the strengths, weaknesses, and blind spots in past research, the results of these tests can motivate, guide, and constrain the next generation of experimental and theoretical investigations into the olfactory system.

The tremendous success of data-driven approaches for problems in visual coding and scene analysis have been propelled by the wide availability and accessibility of imaging data, as well as the communal adoption of key datasets for testing and benchmarking3,4. In the computer vision community, for example, the MNIST5 and ImageNet6 datasets are understood to be essential proving grounds for any newly proposed algorithm or coding principle. Additionally, new visual coding theories can be quickly tested and prototyped across a large number of heterogeneous datasets that expose them to different contexts and edge-cases, making models more robust through out-of-sample testing7. Large datasets have been curated to further build upon machine learning algorithms within the molecular space, as well8,9,10. Here, we describe a newly curated collection of >40 olfactory datasets and a new suite of data fetching, management, and curation tools that we believe can create a set of benchmarks for olfactory theories and models to help stimulate a new era of data-driven inquiry in the olfaction community.

In the remainder of this paper, we introduce Pyrfume, an integrated data archive which aims to accelerate inquiry in olfactory science. While the importance of data aggregation in olfaction has been recognized by others, and there have been other efforts on this front, Pyrfume is notable for its breadth and coverage, spanning > 40 odorant-linked datasets in mammalian olfaction including human psychophysics and perception, as well as animal psychophysics, behavior, brain imaging, physiology, and pharmacology. All together, it contains information about more than 20,000 identified odorants.

There are many archives, papers, and search engines with data that are useful to olfactory scientists, but it has been difficult to coordinate structured queries across these, and each alone has limitations pertaining to size, coverage, or accessibility. PubChem11, for example, has detailed chemical information for over 10 M molecules, but has very little olfactory data. The well-studied Dravnieks atlas12 data set has molecules whose perceptual qualities are described with a structured vocabulary amenable to machine learning, but contains data for only 138 unique molecules. The National Geographic Smell Survey13 has odor intensity ratings collected from an impressive 1.4 million people, but only for a total of six odorants. Even for investigators with the motivation and technical acumen to integrate olfactory data from different sources, there are still the additional challenges of scraping and cleaning datasets that are effectively siloed in separate repositories, and which employ idiosyncratic formats for organizing data.

Pyrfume aims to overcome these limitations, and is premised on the simple ideas that: 1) most olfaction experiments are straightforward to index (on the unique identifier (ID) of an olfactory stimulus, e.g. molecules, substances, or mixtures at a given concentration), and 2) any olfactory experiment can be generically described as a machine-readable pairing of such stimulus IDs, the task performed with the corresponding stimuli, and the observed individuals and their behaviors. Linking these experimental components allows for a robust data-formatting standard which is flexible enough to accommodate a wide array of experimental designs and data types, and which conforms to principles of good database design. Note that behavior, as used here and throughout the paper, refers to any experimental measurement. These could be human perceptual ratings applied to a given chemical, glomerular responses observed in mouse physiology experiments, or measured sensitivities of receptors in pharmacology experiments, etc. (Fig. 1).

Fig. 1
Fig. 1
Full size image

Overview of the Pyrfume ecosystem. Under the Pyrfume standard, data are always linked to the odor stimuli comprising a given experiment. The features, or experimental measurements, will depend on the particular experiment, but could include data such as amplitudes of glomerular calcium transients, vectorized perceptual descriptors, receptor sensitivities, etc. Any given data archive on Pyrfume (>40 to date) can be easily fetched by using the Pyrfume python and R packages or downloaded directly from GitHub as a zipfile. The orange rows of the odorant x feature matrices indicate odorant molecules, identified with CIDs, common across the experiments. The ability to easily extract data for a given molecule across experimental modalities and model systems is a unique strength of Pyrfume. See Supplementary Figures S1, S2 for examples of this.

Methods

Data collection

There are many archives, papers, and search engines with data that are useful to olfactory scientists, but it has been difficult to coordinate structured queries across these, and each alone has limitations pertaining to size, coverage, or accessibility. In an effort to overcome these limitations, the Pyrfume repository was created to integrate various sources of olfactory data, and format them around a subject-object distinction. Data were derived from online databases and several academic datasets. Table 1 provides a subset of sampled data currently available through Pyrfume.

Table 1 Sample of data inventory currently available in Pyrfume.

Reformatting and organization of collected data

Each data source curated in Pyrfume is standardized to conform to a subject/object design framework, separating the odor objects from behavior of the subject(s) under study. At the object level, the most essential file, called stimuli.csv is indexed on a stimulus ID and maps this ID to the chemical/molecular details of the odorants used. A stimulus could represent a single molecule, substance, or mixture, the applied concentration(s), and potentially other experimental conditions. In (typical) cases where at least one stimulus is a single molecule with known structure, the Pyrfume archive will also contain a file, molecules.csv, that lists all molecules used in that dataset, with columns providing PubChem Compound IDs (CIDs), SMILES14, common names, and IUPAC names. This file is useful for indexing the usage of each kind of molecule across datasets, and also for computing physicochemical features for each such molecule (software packages such as RDKit15 and Mordred16 compute these directly from SMILES). When stimuli correspond to mixtures of unknown provenance, e.g. “cloves”, a unique stimulus ID is generated but, in such cases, it may not be possible to link it back to specific compounds in molecules.csv. In cases where a dataset includes calculated or experimentally measured physicochemical properties of molecules, these may be included in physics.csv, typically indexed on CID. Collectively these describe the odorant “object”. The subject side of the data is principally described in behavior.csv, a long-format dataframe, indexed on stimulus ID. This standard, widely referred to as ‘Tidy’, or sometimes ‘Third normal form’17, can easily accommodate more complex, multi-level designs, and reduces both redundancy in data representation, as well as the unnecessary proliferation of files and tables.

Along with these files, each data archive includes a simple, standardized, machine-readable file named manifest.toml, which describes the contents of the archive. The manifest outlines relevant citations and credits, lists both raw and processed data files, and provides essential context or metadata for interpreting the data. It also includes a list of the code used to process the raw data. Additionally, every archive contains a Python script, main.py, that documents the data processing workflow. This script allows Pyrfume users to view how raw data files have been processed and reproduce or modify the processing pipeline. Whenever possible, raw data files under 5 MB in size will also be included directly in the archive, ensuring availability and accessibility.

Data Records

All datasets are available to download at [10.5281/zenodo.13820408]18, and through the companion python library, and python, R, and REST APIs. They can also be accessed directly on GitHub at http://github.com/pyrfume/pyrfume-data. Pyrfume is intended for non-commercial purposes under FAIR use principles, except where licenses permit commercial use. More information, including full documentation and links to source code, can be found at http://pyrfume.org.

File structure within Pyrfume is designed to align multiple curated datasets and optimize them for data parsing. Each data source is structurally organized into separate files, which may include detailed subject, stimuli, and experimental behavior information. In cases where stimuli are administered to distinct subjects, the subject.csv files provide information specific to each subject, including unique identifiers. For studies involving human participants, these files also include demographic details such as race, ethnicity, and age. Testing stimuli are arranged into two files. The first file, referred to as stimuli.csv, contains experiment specific stimulus identifiers, and lists where applicable, PubChem compound identifiers (CIDs), concentrations, ratios, and solvents. This file can include single molecules, mixtures, or unique substances. The second file, known as molecules.csv, provides detailed information about all tested molecules, which may include CIDs, odor names, Chemical Abstracts Service (CAS) numbers, SMILES, and molecular weights. The behavior.csv file contains all observed experimental measures. It is organized by stimulus, subject, experimental measures, and the values of those measures. The measures column specifies the variable being recorded, while its corresponding value is found in the experimental values column. An example of this layout can be seen in Fig. 2.

Fig. 2
Fig. 2
Full size image

Schematic showing important data-formatting standards of Pyrfume. In a hypothetical and simplified experiment, 2 odorants were tested on two animals, and data were collected across a total of 5 glomeruli. The Pyrfume archive is organized around a subject-object distinction. The molecules.csv file (lower left) contains chemical descriptors, which can be used to index experimental measurements. The stimuli.csv file (lower right) provides a mapping between stimulus conditions and chemical information. The subjects.csv file (upper left) catalogs all experimental levels for all subjects. Lastly, behavior.csv (upper right) is indexed on the stimuli, and contains the actual experimental measurements, with one measurement per row.

Technical Validation

All curated datasets were derived from a wide body of published and commercial olfactory data and evaluated both prior to compilation in the Pyrfume repository by the corresponding data collectors and/or distributors, and reviewed and cross-checked by the authors of this manuscript.

Usage Notes

In addition to the repository files18, Pyrfume is available to access via [http://github.com/pyrfume]. Pyrfume is a live repository in GitHub and will be updated as new datasets are added.

Sample use cases

The basic Pyrfume ecosystem is described in Figs. 1, 2, and is defined by two ways of interacting with data, which we call bottling and unbottling. Bottling involves investigators readying their experimental data for machine learning applications, and the goal of this process is essentially to make life easy for downstream users (data scientists, computational neuroscientists, machine learning engineers, etc.) through standardization. In a typical workflow, an investigator would compile an inventory of all odorants used in their experiments, and then use the Pyrfume functions get_cids() and from_cids() to programatically create the molecules.csv file. Examples of these functions, and others, can be found in Table 2. Creation of the behavior.csv file is more idiosyncratic to the particular experiment under consideration, but is rarely more than an hour’s work. The critical step, as alluded to above, involves defining the measurements that will comprise individual cell values of the dataframe (e.g. “each cell = peak deltaF/F measured in one glomerulus, for one odorant”, or “each cell = an EC50 value reported for one receptor responding to one odorant”, or “each cell = a perceptual rating applied for one descriptor, for one odorant”, etc). As of the writing of this manuscript, there are >40 bottled experiments, comprising data for >20,000 unique odorants. In addition to data assembled from supplemental materials of published research, Pyrfume contains cleaned and digitized versions of several large databases, which are shown in Table 1.

Table 2 The most commonly used Pyrfume functions for creating and working with archives.

Pyrfume offers scientists, engineers, and trainees the opportunity to discover, test, and explore the world of olfactory experience through access to an unprecedented volume and diversity of data linked through a standardized format. Standardization allows for cross-modal analyses, meta-analyses, and benchmark construction for the next generation of predictive models, an example of which can be found in Supplementary Figure 2. The next step is up to the broader research community, whom we welcome to utilize this resource and, where applicable, to contribute their own datasets to increase their visibility and utility. In the past, olfactory research has faced significant constraints posed by the scarcity and inadequacy of available data, making it challenging to draw robust conclusions or develop sophisticated theories. As such, we hope that this resource will provide benchmark datasets for a variety of models and raise the bar for theoretical efforts.