Fig. 1: Workflow diagram of regression model development for predicting neutrophil percentage from gene expression data.

1254 passing samples with CBC test results were used to create machine learning regression models to predict neutrophil percentage. a, b, d Train-test splits for regression model development were created by randomly splitting the 600 unique participants between an 80% train set and 20% test set, then assigning the respective samples to the corresponding set. Three different linear models were created to compare the performance of different methods of feature selection: a biology-based via selection of only blood cell enriched genes, b data-driven via mutual information feature selection from all genes, and d combining the methods to include genes from both biology-based and data-driven selection. c Additionally, an XGBoost regression model (c) was developed with all 58,780 transformed gene counts. We used the best-performing model to predict neutrophil percentage for 2932 PDBP samples and 2711 PPMI samples with no known neutrophil percentage.