/***************************************************************************** # Copyright (C) 1993-1996 by Phil Green. # All rights reserved. # # This software is part of a beta-test version of the swat/cross_match/phrap # package. It should not be redistributed or # used for any commercial purpose, including commercially funded # sequencing, without written permission from the author and the # University of Washington. # # This software is provided ``AS IS'' and any express or implied # warranties, including, but not limited to, the implied warranties of # merchantability and fitness for a particular purpose, are disclaimed. # In no event shall the authors or the University of Washington be # liable for any direct, indirect, incidental, special, exemplary, or # consequential damages (including, but not limited to, procurement of # substitute goods or services; loss of use, data, or profits; or # business interruption) however caused and on any theory of liability, # whether in contract, strict liability, or tort (including negligence # or otherwise) arising in any way out of the use of this software, even # if advised of the possibility of such damage. # #*****************************************************************************/ SWAT, CROSS_MATCH Swat and cross_match are programs for rapid protein and nucleic acid sequence comparison and database search, based on an efficient implementation of the Smith-Waterman-Gotoh algorithm (Smith & Waterman 1981, Gotoh 1982). By revising the recursion relations slightly and making efficient use of word-packing, we have been able to significantly reduce the number of machine instructions executed per Smith-Waterman matrix cell, resulting in a speed enhancement of 2-3 fold over the previous best implementation (Pearson and Miller, 1992). Swat is about 1 / 10 as fast as BLAST, making it quite adequate for performing database searches on a workstation. Statistical significance of the hits is evaluated based on the empirical score distribution for the search, taking into account dependence of the score variance and mean on the length of the database sequence. Cross_match uses the same algorithm as swat, but in addition allows the comparison of a pair of sequences to be constrained to bands of the Smith-Waterman matrix that surround one or more matching words in the sequences. This substantially increases speed for large-scale nucleotide sequence comparisons without significantly compromising sensitivity. We have found cross_match useful for the following tasks: comparing a set of reads to a set of vector sequences to produce vector-masked versions of the reads; comparing contig sequences found by two alternative assembly procedures (e.g. phrap and xbap) to each other; and comparing assembled contigs to the final edited cosmid sequence, in retrospective studies of the accuracy of fully automated assembly. We also now routinely use cross_match to find repeats in finished cosmid sequences and mask them out prior to database searches. Swat and cross_match can also be used to search a profile against a database. PHRAP Phrap is a program for shotgun sequence assembly. Key features include: --Use of data quality information, both direct (from phred trace analysis) and indirect (from pairwise read comparisons), to delineate the likely accurate base calls in each read; this helps discriminate repeats, permits use of the full reads in assembly, and allows a highly accurate consensus sequence to be generated. A probability of error is computed for each consensus sequence position, which can be used to focus human editing on particular regions, to automate decision-making about where additional data are needed, and to provide users of the final sequence with information about local variations in quality. --The full length of each read (or more precisely, as much as is accurate enough to align against other reads) is used in the assembly. Other programs typically require that reads be trimmed fairly conservatively to remove all low-quality data, since the algorithms are such that inaccurate data would tend to cause correct overlaps to be rejected; this necessitates excluding a significant amount of useful (if slightly less accurate) sequence from the initial assembly, which then must be brought in manually later to make additional contig joins and help resolve ambiguities. Our approach minimizes or eliminates the need to make manual joins, and perhaps more importantly, by using all available data it increases the chance of correctly distinguishing repeats during assembly. Use of full reads does complicate analysis somewhat due to the high error rate towards the ends of the reads and the fact that read chimerism is more likely to occur there; thus it is essential to use methods that allow for this. In particular, we use local (Smith-Waterman) alignments for the pairwise read comparisons, rather than global (Needleman-Wunsch). -- There is no masking of repeats from known families, or other use of biological assumptions about the sequence. -- Aberrant data, including likely deletion and chimeric reads, and unremoved sequencing vector, are automatically identified, and treated specially to avoid causing problems with assembly. -- The contig sequence is constructed as a mosaic of the highest quality parts of the reads, rather than as a ``consensus''. This avoids both the complex algorithmic issues associated with multiple alignment methods, and problems that sometimes occur with these methods causing the consensus to be less accurate than individual reads at some positions. The sequence produced by phrap is in fact quite accurate, with less than 1 error per 10 kb in typical datasets. -- Information to provide indications of accuracy and uniqueness of assembly is generated. Possible sites where misassembly may have occurred, or where additional data may be required are flagged. -- Large datasets can be assembled efficiently. On a Dec Alpha 3000/900 workstation (comparable in speed to a 200 Mhz Pentium PC), cosmid datasets typically require an average of 2 minutes, and human BACs 10 min. to several hours. A 36,000 read bacterial genome (assembled at Genome Therapeutics, Inc.) required < 2 hrs. The main factor limiting speed of assembly is the number of repeats in the genomic target sequence, rather than the total number of reads, since repeats disproportionately increase the number of Smith-Waterman comparisons that need to be performed.