Fig. 1: A complex multi-target tree-structured approach enables MuTATE for automated prognostic biomarker and subtype discovery.

a MuTATE enables explainable multi-endpoint ML by evaluating features across clinical endpoints45. Partitions are based on information gained (IG) using highest average multi-target IG (avgIG), highest IG in any target (maxIG), meaningful IG in the most targets (mostIG), lowest average p-value of statistically significant IG (avgPVal), lowest p-value, weighted by number of targets with significant IG (minPVal), significant IG in the most targets (mostPVal), or subtree lookahead (splitError). Trees predict endpoints and identify biomarkers and subtypes. b Synthetic multi-target data were generated using a positive definite covariance matrix of targets using a correlation structure (mean \(\mu\) = 1, SD \(\sigma\) = 1). Features were generated and sampled with replacement for ground truth (GT) definition, targets were divided into leaf quantiles and randomly assigned, resulting in multi-target tree-structured data with a known GT. Clinical cohorts with established expert trees were obtained from TCGA from the NCI Genomic Data Portal. 682 biopsies from three cohorts of 711 patients were included. c In simulations, synthetic data and GT are divided into train/test sets (60/40 data split), and grid search assesses model parameters for model test error, TDR, FDR in 18,400 synthetic datasets. Clinical cohorts were divided into train/test sets (60/40 data split), training sets underwent parameter tuning, model performance was captured. Tuned parameters used in trained models were applied to the full cohorts. Final trees were assessed for prognostic significance of partitions, biomarkers, and subtypes. See Figs. S1-6.