
Genome sequencing tells us the order of the As, Cs, Gs and Ts in a genome, but annotation programs make sense of it all by telling us where genes are and what they look like. Gene-finding programs are reasonably good at identifying protein-coding regions, but are less proficient at finding other potentially important sequences — such as cis-regulatory regions and non-coding exons — that lie upstream of the translational start site. Now, Davuluri and colleagues have filled this technical gap by developing a program that accurately recognizes promoters and first exons. Although the program was developed to annotate the human genome, the authors believe it will also prove useful for the genomes of other species.
The starting point in constructing any sequence prediction program involves 'training' the algorithm to recognize the type of sequence you want. Because most sequence annotations do not contain information about 5′ untranslated regions, the authors constructed their own data set of more than 2,000 genes for which first exons and promoters had been experimentally validated. Using these sequences, the algorithm 'learned' to recognize features ∼500 bp either side of the first exon — defined as the region between a promoter and the first splice-donor site. The program — called first-exon finder or FirstEF – operates by finding every potential promoter and splice-donor site and then calculating the probability that the intervening sequence is a first exon. The power of FirstEF lies in its abilty to identify first exons that are associated with either CpG-rich or CpG-poor promoters, and to predict both coding and non-coding first exons. Two tests confirm the accuracy of FirstEF. When the algorithm was trained on 90% of the gene data set and then tested on the remaining 10%, it correctly predicted 84% of first exons. Its performance on the annotated genomic sequences of human chromosomes 21 and 22 (from the public consortium) was also quite impressive, whether it was asked to confirm experimentally validated first exons or to localize promoters upstream of annotated genes. FirstEF is the first and the only computational tool available at present that can predict first exons, especially non-coding ones.
The effort of annotating the human genome is likely to continue for many more years, but FirstEF has brought bioinformatics one step closer to its goal of defining the 5′ boundaries and non-coding regions of genes. Notably, FirstEF has estimated the percentage of CpG-related first exons to be 70%, and not 50% as was previously believed. And, if you like a challenge, the authors have made FirstEF's predictions — all 68,645 of them — from the working draft of the human genome available for scrutiny.
ORIGINAL RESEARCH PAPER
Davuluri, R. V. et al. Computational identification of promoters and first exons in the human genome. Nature Genet. 29, 412–417 (2001)
Related links
Rights and permissions
About this article
Cite this article
Casci, T. Filling the gap in gene prediction. Nat Rev Genet 3, 7 (2002). https://doi.org/10.1038/nrg715
Issue date:
DOI: https://doi.org/10.1038/nrg715