Figure 1

Data processing paradigm flowchart. Data curation was performed to identify the gene expression datasets with the given biological process perturbed (e.g., the process is increased in CMP 1 with \(+1\) and is decreased in CMP 2 or CMP m with direction \(-1\)). DEG analysis was performed on the curated datasets, and \(+1/-1/0\) represents the up-regulated, down-regulated, or unchanged genes, respectively. To prioritize important genes in the biological process for each gene in a curation dataset, an affinity score of \(+1/-1/0\) was calculated first by comparing the gene expression change and the regulation of the biological process, where \(+1\) indicates that the gene (e.g., Gene 1 in CMP 1 and CMP 2) is positively related to the biological process, \(-1\) indicates that the gene (e.g., Gene 2 in CMP m and Gene n in CMP 1) is negatively related to the biological process, and \(0\) indicates no relation of the gene to the biological process. No measurement (notated as NA, e.g., Gene 3 in CMP 1) indicates an unknown affinity of the gene in the dataset. By summing the affinity scores, a consensus score was calculated for genes in the perturbed datasets. Genes with higher consensus scores were identified as more related to the biological process.