Figure 27: Substitution patterns in interspersed repeats differ as a function of GC content.

We collected all copies of five DNA transposons (Tigger1, Tigger2, Charlie3, MER1 and HSMAR2), chosen for their high copy number and well defined consensus sequences. DNA transposons are optimal for the study of neutral substitutions: they do not segregate into subfamilies with diagnostic differences, presumably because they are short-lived and new active families do not evolve in a genome (see text). Duplicates and close paralogues resulting from duplication after transposition were eliminated. The copies were grouped on the basis of GC content of the flanking 1,000 bp on both sides and aligned to the consensus sequence (representing the state of the copy at integration). Recursive efforts using parameters arising from this study did not change the alignments significantly. Alignments were inspected by hand, and obvious misalignments caused by insertions and duplications were eliminated. Substitutions (n=80,000) were counted for each position in the consensus, excluding those in CpG dinucleotides, and a substitution frequency matrix was defined. From the matrices for each repeat (which corresponded to different ages), a single rate matrix was calculated for these bins of GC content (< 40% GC, 40–47% GC and > 47% GC). Data are shown for a repeat with an average divergence (in non-CpG sites) of 18% in 43% GC content (the repeat has slightly higher divergence in AT-rich DNA and lower in GC-rich DNA). From the rate matrix, we calculated log-likelihood matrices with different entropies (divergence levels), which are theoretically optimal for alignments of neutrally diverged copies to their common ancestral state (A. Kas and A. F. A. Smit, unpublished). These matrices are in use by the RepeatMasker program.