EST data and assembly. We obtained chromatograms for 1,043,599 human ESTs (essentially all those available prior to Aug. 5, 1999), from 168 cDNA libraries, which are available from the Washington University Genome Sequencing Center's Merck and CGAP archives. Chromatograms lacking a corresponding dbEST1 entry were discarded. Phred 2 , 3 was used to derive base calls and quality values (log-transformed error probabilities).
Cloning vector sequences were obtained from ref. 4; from the IMAGE, dbEST, NCI/CGAP WWW sites; and by mini-assembly of the first 125 bases of a sample of 500 reads from each of the EST libraries. The portions of each EST which derive from the cloning vector were then identified by sequence comparison using cross_match (see below), and masked.
The high-quality segment of each EST was determined as follows: The i-th base of the EST is assigned a score of si = .05 - ri, where ri is the phred error probability of the base. The score of the read segment spanning bases m to n is then defined as the sum of si from i = m to n; the segment having the highest score is easily determined via a simple dynamic programming algorithm. Roughly speaking, this corresponds to the portion of the read for which the estimated base-calling accuracy is 95% or higher. Reads having high-quality segments shorter than 30 bases were eliminated from further analysis. The average length of the high-quality segments in the remaining reads is 350 bases.
Following elimination of low-quality ESTs and those not in dbEST there remained 992,353 ESTs. Due to their large numbers and uneven distribution among genes we carried out the assembly in a stepwise fashion. A random subset of 300,000 of the ESTs was first selected, and assembled using phrap. (For assembly, the full vector-masked sequence of each read was used, and information regarding EST orientation was ignored). The entire set of 992,353 EST reads was then compared to the contigs using cross_match (see below). Contigs with 20 or more matching reads were identified, and all reads not matching one of these contigs were assembled, in a second pass.
The entire set of reads was then compared to the combined set of contigs from the two assembly passes. To be accepted, matches were required to have no more than five discrepancies in the high-quality portion of a read (not counting discrepancies unconfirmed by any other read), and to have no more than ten bases at each end of the read's high-quality segment which failed to align. The set of all reads not matching any contig by these criteria was then assembled, in a third pass. The entire set of reads was then compared to the combined set of contigs from all three passes, and acceptable matches identified.
We then extracted the portion of each contig sequence that was "confirmed" in the sense that it matched (by the above criteria) the high-quality regions of reads from two or more distinct clones. Because of small but non-trivial rates of lane-tracking errors which occasionally assign the wrong clone identifier to a read 5 , 6, and well-to-well cross-contamination within and between microtiter plates, we considered two reads to be from different clones only if they came from distinct microtiter plates (indicated by the first four characters of the read name) and had names differing by at least two characters.
Approximately 5 to 10% of EST reads are chimeric, due to errors in cloning or lane-tracking 5 , 6. Inclusion of these in the assembly may create chimeric contigs which spuriously join segments from two different genes. After each assembly pass we identified contigs which were potentially chimeric by virtue of their having internal regions which matched only one read (or several reads all derived from the same microtiter plate), flanked by regions matching multiple reads. Such contigs were split into their confirmed pieces, and reads spanning the internal region were flagged as likely chimeras and discarded from further consideration. Additional likely chimeric reads were identified by virtue of being the unique read matching two different contigs. In all, approximately 35,000 reads were identified as probable chimeras and eliminated.
Following assembly, 27 contigs found to match E. coli or lambda, sequence, and 144 contigs consisting primarily (>= 80%) of reads from 103 plates known to include contaminating mouse ESTs, were removed as likely contaminants.
We further eliminated contigs whose confirmed length was less than 150 bases. There remained 62,064 contig sequences, which were used in the comparisons below. Approximately 795,000 of the ESTs have alignments to these contigs meeting the criteria above; the remaining 197,000 ESTs consist primarily of unconfirmed singletons (with no matches to other reads), chimeras, contaminants, unconfirmed splice variants, or other artifacts.
43,278 of the contigs were 3' contigs (80% or more of the reads supported a consistent orientation to the contig, and at least 25% of the reads, and a minimum of two, were derived from the putative 3' end of the cDNA clones).
mRNA clustering Human entries of locus type mRNA, RNA, and DNA, and containing annotated coding sequence in a single contiguous base range were extracted from Genbank Release 115.0 (Dec. 15, 1999). Entries with the `pseudo' qualifier or with definitions containing the words `mutant', `mutation', `variant', and `variation', or for which start or stop codon placement was inconsistent with annotation, were excluded. A total of 21,031 sequences meeting these criteria were identified. These were compared to each other using cross_match (see below), and then clustered first into "reference sequence groups" and then genes.
To be assigned to the same reference sequence group sequences were required to be 98% identical over at least 90% of their length. There were 13,212 reference sequence groups. To be assigned to the same gene, allowing for different alternatively spliced forms, sequences were required to be at least 98% identical over a 150 base or longer region that overlapped the coding sequence of the gene and included at least 20% of the untranslated region of the gene. Manual inspection of a number of assignments suggest that these criteria have non-zero, but small (< 5%) false positive and false negative rates.
There were 9,465 gene clusters produced, many of which however contained only partial sequence. For the sequence comparisons we only considered genes which included at least one mRNA with an apparent full-length coding sequence at least 100 bases long, and at least 20 bases of 3' untranslated sequence. There were 7,662 genes meeting these criteria.
Chromosome 22 sequence. The chromosome 22 sequence was taken from the file Chr_22_analysis_version_22-10-1999.fa, downloaded from the Sanger Centre. The estimated number of genes on chromosome 22 was based on numbers given in ref. 7, as follows: 545 genes were identified through intensive annotation efforts. (We do not include pseudogenes, which are irrelevant for the present purpose). An additional 325 putative genes were predicted by the program Genscan 8, of which 100 are estimated to be real and the remainder false positives. The false negative rate is reported to be < 5%. We therefore take 545 + 100 + .05 × 700 = 680 to be the estimated number of genes on this chromosome. The authors suggest that the number of CpG islands they detect may point to a higher gene number, however a higher number would conflict with the reported false positive and false negative rates of Genscan. We consider it more likely that the excess CpG island count is attributable to the fact that such islands are often associated with repetitive elements, pseudogenes, and RNA genes 9 and/or may reflect the relatively high G+C content of this chromosome as compared to the genome average.
Sequence comparisons. Comparisons between sequences were performed using our program cross_match, which implements a banded Smith-Waterman algorithm10. Cross_match finds perfectly matching sequence segments of a specified minimum length (15 for the comparisons to repbase, and 30 for the comparisons between the main sequence sets) between pairs of sequences, and then searches a "band" of width 15 centered on the diagonal of the Smith-Waterman matrix defined by the word match to find the highest scoring gapped alignment. We used relatively sensitive alignment scoring parameters of +1 for a match, -2 for a substitution mismatch, -4 for gap opening, and -3 for gap extension, but imposed stringent additional requirements (given below) regarding % identity and the proportion of the query sequence represented in the alignment to filter out alignments not likely to be true matches.
All sequences were initially compared to the database repbase of mammalian repeats (Smit, A., Jurka J., Kapitonov, V., Niak, A.), using relatively sensitive parameter settings. Repeats were tagged, rather than masked. Maximal sensitivity was not required, since the stringent criteria applied below would eliminate matches between all but the most evolutionarily recent repeats. In all subsequent comparisons of sequences, matches lying essentially entirely within a tagged repeat (allowing at most 20 bases past either end of the tagged region) were discarded.
For comparisons of the contig sequences to chromosome 22 and to the mRNAs, we discarded all matches having a greater than 2% discrepancy rate (i.e. the aligned regions were required to be at least 98% identical). For the chromosome 22 comparison, we additionally required matches to include at least 90% of the contig length. For the mRNA comparison this was relaxed to a minimum of 100 aligned bases since some of the mRNAs are incomplete, and/or represent different alternatively spliced forms for the same gene.
There are likely to be some false positives and false negatives in these comparisons. False positive matches would occur with closely paralogous genes, less than 2% different at the DNA level, created by evolutionarily recent duplications (within the last 10 million years). While there are certainly examples of these 7, they are not currently believed to involve a significant percentage of genes. A likely more substantial source of error is false negatives (missed matches) arising from incomplete or inaccurate sequences which cause the alignment to fail to meet our stringency criteria. The effect of these would be to inflate the gene estimate above its true value.
Technical comments on previous gene number estimates. Using a method similar to ours in which EST clusters were compared to mRNAs, Fields et al. 11 arrived at an estimate of 60,00070,000 genes. A major difference with our calculation was that they included clusters consisting of a single unconfirmed EST. We believe this may spuriously inflate the estimate by a significant amount due to the presence of contaminant, artifactual, or inaccurate ESTs. The cDNA library normalization process tends to enrich for contaminants and aberrant clones, and even if they still represent only a tiny percentage of the total EST set they can constitute a large fraction of the clusters. For that reason we used conservative criteria for deriving contig sequences. Although many of the unconfirmed ESTs are undoubtedly real, eliminating them does not affect the validity of our calculation, because the set of EST contigs is not assumed or required to be a random sample of all genes and part of it can therefore be removed without biassing the estimate.
We believe that even some of what we consider "confirmed" EST contigs are cases in which a single contaminant or artifactual clone spuriously confirms itself because it has been duplicated due to library amplification or well-to-well or plate-to-plate cross-contamination. Such cases may account in part for recent unpublished estimates of a very high gene number based on counting EST contigs assembled from large private databases 12. A low rate of such artifacts still produces large absolute numbers when the data set is very large.
Antequera and Bird13 estimated that there are 45,000 unmethylated CpG islands in the human genome, which, under the assumptions that 56% of genes have an island, and that all islands are associated with genes, extrapolates to an estimate of 80,000 genes. Several aspects of their calculation are open to question and may have inflated this estimate: (i) In converting from tritiated counts to genome percentage (Table 1 of ref. 13), it was assumed that DNA fractions 1 and 2 have the same overall G+C content as the CpG island fragments in these fractions. However these fractions were known to contain a significant fraction (28.6%) of non-island DNA, which may reasonably be assumed to have a G+C content equal to the genome average (40%) rather than that of island fragments (67.1%). Making this correction reduces the estimated number of islands from 45,000 to 34,200. (ii) Our own analyses (unpublished) of CpG islands suggest that the estimated length of an island is somewhat sensitive to the precise definition that is applied. Using a definition more sensitive to CpG dinucleotide frequency (which should be the best indicator of methylation status) than to C+G content, we obtain an average island length of about 1.3 kb rather than 1 kb. Use of this value would reduce the island count further, to 26,300, and the estimated number of genes to about 47,000. (iii) There may be islands not associated with genes. For example, many transposable elements and pseudogenes are known to have CpG islands 9, some of which may be unmethylated; and as some tissue-specific genes have one or more islands associated with cryptic internal promoters that do not produce translatable transcripts 9, it seems possible that other islands are not associated with any gene.(iv) A further possibility is that the size of the gene-bearing, euchromatic portion of the genome, which must be assumed in the CpG island calculation but is irrelevant in ours, is substantially smaller than has been assumed, as in fact was the case with chromosome 22 ( ref. 7).
1. Boguski, M.S., Lowe, T.M. & Tolstoshev, C.M. dbEST database for "expressed sequence tags". Nature Genet. 4, 332-333 (1993).
2. Ewing, B., Hillier, L., Wendl, M. & Green, P. Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175-185 (1998).
3. Ewing, B. & Green, P. Basecalling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186-194 (1998).
4. Bonaldo, M., Lennon, G. & Soares, M.B. Normalization and subtraction: two approaches to facilitate gene discovery. Genome Res. 6, 791-806 (1996).
5. Hillier, L. et al. Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 6, 807-828 (1996).
6. Wolfsberg, T.G. & Landsman, D. A comparison of expressed sequence tags (ESTs) to human genomic sequences. Nucleic Acids Res. 25, 1626-1632 (1997).
7. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489-495 (1999).
8. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94 (1997).
9. Larsen, F., Gundersen, G., Lopez, R. & Prydz, H. CpG islands as gene markers in the human genome. Genomics 13, 1095-1107 (1992).
10. Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197 (1981).
11. Fields, C., Adams, M.D., White, O. & Venter, J.C. How many genes in the h uman genome? Nature Genet. 7, 345-346 (1994).
12. Dickson, D. Gene estimate rises as US and UK discuss freedom of access. Nature 401, 311 (1999).
13. Antequera, F. & Bird, A. Number of CpG islands and genes in human and mouse. Proc. Natl. Acad. Sci. USA 90, 11995-11999 (1993).