Supplementary Figure 1: The structure of the human–mouse gene expression training compendium.
From: Found In Translation: a machine learning model for mouse-to-human inference

(a) Expression datasets were assembled from the Gene Expression Omnibus (GEO) and were manually curated. Each dataset contains at least 3 control samples and at least 3 disease samples. GEO datasets within each species were divided into datasets of matching disease and control samples from parallel conditions. (b) The compendium contains 170 Cross-Species Pairings (CSPs) from 28 different diseases, spanning 3,033 human samples and 1,181 mouse samples. We divide the datasets into two types of CSPs: Standard (ST), which include human and mouse datasets that were conducted in separate experiments; and Reference (RF), in which the human and mouse datasets were directly contrasted in a publication authored by researchers who had generated at least one of the datasets. (c) Number of CSPs per disease. (d) The composition of the training compendium by technology and dataset type. (e) Summary of platform types included in the compendium in human (top) and mouse (bottom). (f) Summary of tissue types included in the training compendium in human (top) and mouse (bottom).