Although gene prediction programs have advanced significantly in recent years, the best evidence for the actual structure of a gene remains the sequence of a full-length cDNA clone. The most concerted effort to generate such a resource has been carried out in the mouse by Yoshihide Hayashizaki, Yasushi Okazaki, Piero Carninci, and their collaborators and colleagues at the Riken Genomic Sciences Center in Tskuba, Japan. Their efforts to generate a “Mouse Genome Encyclopedia” have involved the development of an innovative cap-trapper method for the construction of full-length libraries, the creation of a new kind of 384-capillary DNA sequencer, and the development of automated systems for DNA template preparation and sequencing reactions. After sequencing more than 1,000,000 expressed sequence tags (ESTs) from these libraries, they chose to completely sequence a set of 21,076 clones that they hoped would prove to be a non-redundant set of full-length clones. They chose to concentrate on the shorter clones in their libraries to develop their techniques, with the goal of selecting and sequencing additional 20k sets. But as they neared completion of the sequencing phase, they realized that they required additional expertise to obtain a thorough functional annotation of the set. Drawing on the example of the “Drosophila jamboree”, Hayashizaki and colleagues organized the Functional Annotation of Mouse (FANTOM) Meeting to gather experts in bioinformatics and biology to collaboratively annotate the clone set, with the promise of shared authorship as enticement.
From Monday 28 August through Friday 8 September, a diverse group of nearly 50 people from around the world gathered at Riken in Tskuba, an hour's bus ride from Tokyo, to analyse, characterize and annotate the first installment of the Mouse Encyclopedia. From the first day, we were faced with significant challenges in organizing the data, integrating the results of the various searches that participants had carried out before arriving, and developing a common language (a controlled vocabulary) so that we could later make sense of our annotation. But this was something we all expected—no one has ever annotated such a large collection of completely sequenced cDNAs. While the hours were too long and jet lag rolled across the room in predictable waves, everyone was committed to the task. As we began to wade through the data, we realized that we would have to develop a strategy to prioritize the sequences for annotation so that we could make the best use of our time; after all, each had to annotate an average of more than 400 cDNAs (21,076 sequences/50 annotators >400!). After eliminating redundancy and passing cDNAs that were clearly derived from known genes to a team from the Mouse Genome Database, the remaining sequences were classified based on the evidence we had for annotation; we began with the easy-to-annotate sequences with large open reading frames, progressed to the more challenging clones, and finished with the ESTs that no clear open reading frame, sequence homology, or previously classified domains. This strategy allowed us to add searches and analyses that were necessary for the difficult-to-annotate clones while Hideomasa Bono and his team laboured days and nights to ensure that the FANTOM annotation software kept pace with our requests.
This is a preview of subscription content, access via your institution