Our broad research goal is to help
provide the computational methods necessary to achieve a complete, quantitative
understanding of how cells function at the molecular level. Such an understanding
will require three things: a "parts list", or catalogue of all cellular
molecules; a "wiring diagram" that specifies the interactions that
occur between those molecules; and, finally, quantitative models of systems
of interacting molecules. The advent of large-scale genome sequencing is bringing
the possibility of completing the parts list within view, although substantial
work remains to be done. Most current research in molecular biology is directed
at filling the wiring diagram (which may be taken as specifying molecular
"function"). The modeling of molecular systems, still in its infancy,
will become increasingly important as the wiring diagram approaches completion
and our ability to accurately quantitate cellular molecules improve.
Most of our research has been directed at constructing computational tools
to support the acquisition of the parts list, in the form of a gene-annotated
genome sequence. Common themes in this work include the development of appropriate
probabilistic models for the type of data to be analyzed, the construction
of efficient algorithms to carry out the probabilistic calculations, and the
implementation of the algorithms in software which is then made widely available
to the scientific user community. Probabilistic methods in particular have
proven to be crucial in all of these areas, a reflection of the inherently
probabilistic nature of such biological processes as meiotic recombination
and sequence evolution, as well as of laboratory data.
Reliable identification of the protein parts list from a genome sequence remains
an unsolved problem, which we are attacking on several fronts including improved
probabilistic modeling of the genomic sequence, comparison to evolutionarily
related sequences, and more effective utilization of available experimental
data. Some of our work draws on our experience with sequence data processing
in order to assemble expressed sequenced tags (ESTs), partial gene sequences
that have been generated in a number of sequencing laboratories and submitted
in unassembled form to the public databases. We recently used our EST assemblies
to conclude that the number of genes in the human genome is substantially
lower (about 35,000) than had been previously thought; we are applying the
assemblies to make more reliable inferences of protein coding sequences, to
catalogue alternative splicing (which can result in multiple proteins being
encoded by the same gene), and to discover polymorphisms (differences in sequence
between individuals). We are also beginning to undertake systematic experimental
tests of our gene predictions.
The availability of sequence data from evolutionarily related organisms provides
a powerful tool for identifying genes and illuminating their function. Through
comparisons of yeast, human and nematode sequences we observed a number of
years ago that a substantial fraction of genes (approaching 50%) appeared
unique to an organism and its close relatives, an observation that has been
repeatedly borne out with each new genome sequence that has been obtained.
Most likely many of the "unique" genes do in fact have evolutionary
homologues in more distant organisms but are simply evolving too quickly for
the relationship to be readily detected, and we have developed methods for
more sensitive detection of evolutionariy conserved sequence features. Evolutionary
data should in principle help to understand the "wiring diagram"
of molecular interactions, since it is primarily molecular interactions (including
the self-interactions which determine tertiary structure) that constrain the
allowed residue substitutions. We are currently working on improved probabilistic
models of sequence evolution in the hope that these will allow such functional
inferences.
.