Fig. 1: CHORD is a random forest Classifier of HOmologous Recombination Deficiency able to distinguish between BRCA1- and BRCA2-type HRD phenotypes in a pan-cancer context.
From: Pan-cancer landscape of homologous recombination deficiency

a The features used for training CHORD are relative counts of different mutation contexts, which fall into one of three groups based on mutation type. (i) Single nucleotide variants (SNV): six possible base substitutions (Cā>āA, Cā>āG, Cā>āT, Tā>āA, Tā>āC, Tā>āG). (ii) Indels: indels with flanking microhomology (del.mh, ins.mh), within repeat regions (del.rep, del.none), or not falling into either of these 2 categories (del.none, ins.none). (iii) Structural variants (SV): SVs stratified by type and length. Relative counts were calculated separately for each of the 3 mutation types. b Training and application of CHORD. From a total of 3,824 metastatic tumor samples, 2026 samples were selected for training CHORD. The model outputs the probability of BRCA1-type HRD and BRCA2-type HRD, with the probability of HRD being the sum of these 2 probabilities. The performance of CHORD was assessed via a 10-fold nested cross-validation (CV) procedure on the training samples, as well as by applying the model to the BRCA-EU dataset (543 primary breast tumors) and PCAWG dataset (1,854 primary tumors). Lastly, CHORD was applied to all samples in the HMF and PCAWG dataset in order to gain insights into the pan-cancer landscape of HRD. c The features used by CHORD to predict HRD as well as BRCA1-type HRD and BRCA2-type HRD, with their importance indicated by mean decrease in accuracy. Deletions with 2 to ā„5ābp (i.e. ā„2ābp) of flanking microhomology (del.mh.bimh.2.5) was the most important feature for predicting HRD as a whole, with 1ā100ākb structural duplications (DUP_1e03_1e04_bp, DUP_1e04_1e05_bp) differentiating BRCA1-type HRD from BRCA2-type HRD. Boxplot and dots (nā=ā10) show the feature importance over 10-folds of nested CV on the training set, with the red line showing the feature importance in the final CHORD model. Boxes show the interquartile range (IQR) and whiskers show the largest/smallest values within 1.5 times the IQR.