general.doc

/*****************************************************************************
#   Copyright (C) 1993-1996 by Phil Green.                          
#   All rights reserved.                           
#                                                                           
#   This software is part of a beta-test version of the swat/cross_match/phrap 
#   package.  It should not be redistributed or
#   used for any commercial purpose, including commercially funded
#   sequencing, without written permission from the author and the
#   University of Washington.
#   
#   This software is provided ``AS IS'' and any express or implied
#   warranties, including, but not limited to, the implied warranties of
#   merchantability and fitness for a particular purpose, are disclaimed.
#   In no event shall the authors or the University of Washington be
#   liable for any direct, indirect, incidental, special, exemplary, or
#   consequential damages (including, but not limited to, procurement of
#   substitute goods or services; loss of use, data, or profits; or
#   business interruption) however caused and on any theory of liability,
#   whether in contract, strict liability, or tort (including negligence
#   or otherwise) arising in any way out of the use of this software, even
#   if advised of the possibility of such damage.
#                                                                         
#*****************************************************************************/

SWAT, CROSS_MATCH
 Swat and cross_match are programs for rapid protein and nucleic acid
sequence comparison and database search, based on an efficient
implementation of the Smith-Waterman-Gotoh algorithm (Smith & Waterman
1981, Gotoh 1982).  By revising the recursion relations slightly and
making efficient use of word-packing, we have been able to
significantly reduce the number of machine instructions executed per
Smith-Waterman matrix cell, resulting in a speed enhancement of 2-3
fold over the previous best implementation (Pearson and Miller, 1992).
Swat is about 1 / 10 as fast as BLAST, making it quite adequate for
performing database searches on a workstation.  Statistical
significance of the hits is evaluated based on the empirical score
distribution for the search, taking into account dependence of the
score variance and mean on the length of the database sequence.

 Cross_match uses the same algorithm as swat, but in addition allows
the comparison of a pair of sequences to be constrained to bands of
the Smith-Waterman matrix that surround one or more matching words in
the sequences. This substantially increases speed for large-scale
nucleotide sequence comparisons without significantly compromising
sensitivity.  We have found cross_match useful for the following
tasks: comparing a set of reads to a set of vector sequences to
produce vector-masked versions of the reads; comparing contig
sequences found by two alternative assembly procedures (e.g.  phrap
and xbap) to each other; and comparing assembled contigs to the final
edited cosmid sequence, in retrospective studies of the accuracy of
fully automated assembly.  We also now routinely use cross_match to
find repeats in finished cosmid sequences and mask them out prior to
database searches.

 Swat and cross_match can also be used to search a profile against a
database.

PHRAP
Phrap is a program for shotgun sequence assembly.  Key features
include:
--Use of data quality information, both direct (from phred trace
analysis) and indirect (from pairwise read comparisons), to delineate
the likely accurate base calls in each read; this helps discriminate
repeats, permits use of the full reads in assembly, and allows a
highly accurate consensus sequence to be generated.  A probability of
error is computed for each consensus sequence position, which can be
used to focus human editing on particular regions, to automate
decision-making about where additional data are needed, and to provide
users of the final sequence with information about local variations in
quality.

--The full length of each read (or more precisely, as much as is
accurate enough to align against other reads) is used in the assembly.
Other programs typically require that reads be trimmed fairly
conservatively to remove all low-quality data, since the algorithms
are such that inaccurate data would tend to cause correct overlaps to
be rejected; this necessitates excluding a significant amount of
useful (if slightly less accurate) sequence from the initial assembly,
which then must be brought in manually later to make additional contig
joins and help resolve ambiguities. Our approach minimizes or
eliminates the need to make manual joins, and perhaps more
importantly, by using all available data it increases the chance of
correctly distinguishing repeats during assembly.  Use of full reads
does complicate analysis somewhat due to the high error rate towards
the ends of the reads and the fact that read chimerism is more likely
to occur there; thus it is essential to use methods that allow for
this. In particular, we use local (Smith-Waterman) alignments for the
pairwise read comparisons, rather than global (Needleman-Wunsch).

-- There is no masking of repeats from known families, or other use of 
biological assumptions about the sequence.

-- Aberrant data, including likely deletion and chimeric reads, and
unremoved sequencing vector, are automatically identified, and treated
specially to avoid causing problems with assembly.

-- The contig sequence is constructed as a mosaic of the highest
quality parts of the reads, rather than as a ``consensus''.  This
avoids both the complex algorithmic issues associated with multiple
alignment methods, and problems that sometimes occur with these methods
causing the consensus to be less accurate than individual reads
at some positions. The sequence produced by phrap is in fact quite
accurate, with less than 1 error per 10 kb in typical datasets.

-- Information to provide indications of accuracy and uniqueness of
assembly is generated. Possible sites where misassembly may have
occurred, or where additional data may be required are flagged.

-- Large datasets can be assembled efficiently. On a Dec Alpha
3000/900 workstation (comparable in speed to a 200 Mhz Pentium PC),
cosmid datasets typically require an average of 2 minutes, and human
BACs 10 min. to several hours.  A 36,000 read bacterial genome
(assembled at Genome Therapeutics, Inc.) required < 2 hrs.  The main
factor limiting speed of assembly is the number of repeats in the
genomic target sequence, rather than the total number of reads, since
repeats disproportionately increase the number of Smith-Waterman
comparisons that need to be performed.