Our broad research goal is to help provide the computational methods necessary to achieve a complete, quantitative understanding of how cells function at the molecular level. Such an understanding will require three things: a "parts list", or catalogue of all cellular molecules; a "wiring diagram" that specifies the interactions that occur between those molecules; and, finally, quantitative models of systems of interacting molecules. The advent of large-scale genome sequencing is bringing the possibility of completing the parts list within view, although substantial work remains to be done. Most current research in molecular biology is directed at filling the wiring diagram (which may be taken as specifying molecular "function"). The modeling of molecular systems, still in its infancy, will become increasingly important as the wiring diagram approaches completion and our ability to accurately quantitate cellular molecules improve.

Most of our research has been directed at constructing computational tools to support the acquisition of the parts list, in the form of a gene-annotated genome sequence. Common themes in this work include the development of appropriate probabilistic models for the type of data to be analyzed, the construction of efficient algorithms to carry out the probabilistic calculations, and the implementation of the algorithms in software which is then made widely available to the scientific user community. Probabilistic methods in particular have proven to be crucial in all of these areas, a reflection of the inherently probabilistic nature of such biological processes as meiotic recombination and sequence evolution, as well as of laboratory data.

Reliable identification of the protein parts list from a genome sequence remains an unsolved problem, which we are attacking on several fronts including improved probabilistic modeling of the genomic sequence, comparison to evolutionarily related sequences, and more effective utilization of available experimental data. Some of our work draws on our experience with sequence data processing in order to assemble expressed sequenced tags (ESTs), partial gene sequences that have been generated in a number of sequencing laboratories and submitted in unassembled form to the public databases. We recently used our EST assemblies to conclude that the number of genes in the human genome is substantially lower (about 35,000) than had been previously thought; we are applying the assemblies to make more reliable inferences of protein coding sequences, to catalogue alternative splicing (which can result in multiple proteins being encoded by the same gene), and to discover polymorphisms (differences in sequence between individuals). We are also beginning to undertake systematic experimental tests of our gene predictions.

The availability of sequence data from evolutionarily related organisms provides a powerful tool for identifying genes and illuminating their function. Through comparisons of yeast, human and nematode sequences we observed a number of years ago that a substantial fraction of genes (approaching 50%) appeared unique to an organism and its close relatives, an observation that has been repeatedly borne out with each new genome sequence that has been obtained. Most likely many of the "unique" genes do in fact have evolutionary homologues in more distant organisms but are simply evolving too quickly for the relationship to be readily detected, and we have developed methods for more sensitive detection of evolutionariy conserved sequence features. Evolutionary data should in principle help to understand the "wiring diagram" of molecular interactions, since it is primarily molecular interactions (including the self-interactions which determine tertiary structure) that constrain the allowed residue substitutions. We are currently working on improved probabilistic models of sequence evolution in the hope that these will allow such functional inferences.