Extended Data Fig. 4: Predictive models overview and optimization. | Nature Genetics

Extended Data Fig. 4: Predictive models overview and optimization.

From: Genome-scale quantification and prediction of pathogenic stop codon readthrough by small molecules

Extended Data Fig. 4

a, Drug-specific models cross-validated predictive performance for FUr, gentamicin and untreated conditions. b, Downsampling the number of readthrough variants yields decreased model performance for the high-performing drug models (Supplementary Note 6). X-axis shows the number of variants with readthrough >1% retained and used to rerun the models (gray). We used control models randomly removing the same number of variants to control for the effect of smaller training sizes in model performance (black). Models with 20, 50 and 150 variants retained intend to represent similar scenarios to untreated, gentamicin and FUr datasets. The r2s shown are the average over 10 cross-validation rounds. c, Comparison of the r2 values for the drug-specific models when using stop type, the three nucleotides downstream and upstream of the PTC and the interaction between stop type and three nucleotides downstream versus when using ElasticNet regularization on 47 sequence features (Extended Data Fig. 1f and Supplementary Table 4). d, Performance for three different model formulations across drugs (Supplementary Note 6): using only stop type and the three nucleotides downstream and upstream of the PTC, adding the stop type and three nucleotides upstream interaction or adding the stop type and three nucleotides downstream interaction. Only the latter consistently improves model performance across drugs. The r2s shown are the average over 10 cross-validation rounds. e, Performance for three different model formulations across drugs: encoding the three nucleotides downstream and upstream of the PTC as nucleotide triplets (m1), encoding the upstream sequence as a nucleotide triplet and the three nucleotides downstream as three different terms (one for each position, no interaction among them, m2) and encoding the downstream sequence as a nucleotide triplet and the three nucleotides upstream as three different terms (one for each position, no interaction among them, m3). m1 consistently yields higher r2 across drugs. The r2s shown are the average over 10 cross-validation rounds. f, Contribution of each sequence feature to the pan-drug model. The y-axis shows the relative drop in r2 when each term is removed from the model and normalized to the full model (1 − (r2 upon term removal/r2 full model)). gj, Model coefficients of the following predictive models: CC90009 and clitocine (g), DAP and SJ6986 (h), G418 and SRI (i) and the down_123 nt (top) and up_123 nt (bottom) coefficients for the pan-drug model (j). Mean, 95% confidence intervals and significance (two-sided Student’s t-test) of the coefficient estimates across 10 cross-validation rounds are shown. Asterisks represent an adjusted p-value < 0.01.

Source data

Back to article page