Extended Data Fig. 8: Heterogeneous clonal landscape and power calculation.
From: Sex and smoking bias in the selection of somatic mutations in human bladder

a) From simulated datasets reflecting the same distributional features and data dependencies found in the study cohort, we computed the statistical power as the proportion of times the variable of interest (sex) came out significant in the univariate linear mixed-effects regression against truncating dN/dS. In this analysis the female group was picked as the baseline group. We simulated data with different ground truth female baselines (expected truncating dN/dS among females) and between-group differences (effect size). For each baseline-effect combination we can draw a power value, which are represented collectively in the form of these power profiles (see Supplementary Note 10). For the two exemplary genes RBM10 and ARID1A we highlight the profile curves corresponding to their observed baselines in the cohort and the projected power given the inferred effect in the cohort. b) Table presenting a summary of the five associations found between dN/dS and sex in the study. Here we briefly define the meaning of each column. See also Supplementary Note 10 for a more in-depth account on the methodology. CSQN: Either missense or truncating, represents the specific dN/dS used as response variable in the association analysis. ESTIMATE: Coefficient of the binary variable of interest (“is_male”) inferred via linear-mixed effects regression against dN/dS using the donor as a random intercept. CI_LOW, CI_HIGH: Lower and upper 95% CI bounds of ESTIMATE. PVAL: p-value associated with the variable of interest in the regression analysis. INTERCEPT: Inferred intercept in the regression analysis. BASELINE: Average CSQN-specific dN/dS value in the baseline group of samples (female). INTERCEPT and BASELINE are expected to follow closely one another. COVARIATE: The (binary) explanatory variable representing sex. POWER: Statistical power corresponding to the BASELINE and ESTIMATE in the power profile. EFFECT_PVAL: The “effect p-value” is an ad-hoc metric that we defined as the proportion of times the sex coefficient attains a value at least as high as ESTIMATE upon regression with a dataset corresponding to BASELINE and zero ground-truth effect. It can be thought of as an effect-aware false positive rate. c) Frequency of tumor samples with missense or truncating mutations of 6 genes in males and females across a cohort of 2,965 bladder carcinomas from the GENIE cohort. d) Multivariate logistic regression (including age) of sex on mutations in the 6 genes. Circles represent the point estimate of the effect size of the linear regression, and the horizontal line, the 95% confidence intervals. Circles with dark outer circumference denote significant associations (FDR threshold of 0.2). e) Distribution of expected number of mutations in the two TERT promoter mutational hotspots (chr5:1295113 and chr5:1295135) across donors younger than 55 years old or never smokers assuming a mutation rate equal to that observed across ever smokers older than 55 years old. The red dashed vertical line represents the actual observed number of mutations in the two hotspots across donors younger than 55 years or never smokers. The p-value was calculated empirically based on 10,000 randomizations, as described in Supplementary Note 10. f) Maximum variant allele frequency detected for activating TERT promoter mutations in a sample vs the number of activating TERT promoter mutations (i.e. mutations observed in tumors, see main text) identified in a sample. The observation of different activating TERT promoter mutations in the same sample indicates the existence of convergent evolution of TERT promoter mutations. This, in turn, suggests that the observation of mutations with large variant allele frequency may also represent multiple mutated clones with the exact mutations (convergent evolution) rather than very large clones.