Fig. 3: The DNA sequence content is a major predictor of DNA replication IS.
From: A predictable conserved DNA base composition signature defines human core DNA replication origins

a Graph showing the percentage of origins in each quantile that overlap with G4 defined by G4Hunter29 (in silico) or mismatches28 (in vitro G4). Dotted lines (CTL) represent overlap with control regions. b Base content of the regions flanking human DNA replication origins and control genomic regions. Frequency plots are centred at the origin summits. The base frequency represents the proportion of each base (0–1). The human genome is composed of 30% A,T and 20% G, C as indicated by genomic average. Origins are oriented with the highest G-content upstream. c Density plot represents the frequency of the distance measured between the initiation site summit (dotted line) and the centre /summit of the nearest ORC1 (red), ORC2 (dark red) and MCM7 (blue) bound regions. Origins are oriented with the highest G-content upstream. d As in c but for stochastic origins. e Schematic representation of a core origin. The vertical line represents the IS summit. The nearest ORC1, ORC2 and MCM7 peak centres are presented, as well as their average distance from the core IS summit. The average size of the ORC1, ORC2 and MCM7 binding sites is indicated on the left. f Bar plot showing the percentage of origins that can be predicted based on the genome-scanning (GS) algorithm. Dotted bars represent the expected amount of overlap with control regions. The pie chart shows the percentage of false-positive results (grey). P-values obtained by Chi-square Goodness-of-Fit test using observed and expected values for overlap. g Percentage of origins in each quantile predictable by the GS algorithm as in f. h Percentage of Mus musculus origins predicted by the GS algorithm as in f. i Bar plots representing the percentage of core origins that can be predicted using a combination of GS algorithm and two different machine-learning algorithms (single vector machine (SVM) and logistic regression (LR) with greedy feature selection). P-values obtained by Chi-square Goodness-of-Fit test using observed and expected values for overlap. j Schematic showing the properties of the regions predicted to be origins. G-richness in the immediate (0.5 Kb) and distal (2 Kb) upstream region to the initiation site are predictive parameters.