Sequence Similarity Searching: Why and How
Peter Hraber, 31 September 2000
Motives
- Who here has run a BLAST search (or used any sequence similarity algorithm)?
- Who here has run a BLAST search using non-default parameters?
- Who here has published a BLAST search result, citing Altschul et al. (1990) or (1997)?
- Who here has read and understood either paper, or how the algorithm works?
- I intend to demystify the black box of sequence similarity searching, so you can use it as a tool to answer research questions in an informed manner.
Why?
- Given a newly-obtained sequence, what is its cellular role or function?
- How might we go about answering this question?
- One way is to fish in the ever-growing pool of sequences whose function
is known, using the new sequence as bait (the "query").
- Some other uses:
- To identify and remove vector contaminants
(e.g., VecScreen at NCBI)
- For applications in comparative genomics (Braun et al., 2000)
- To identify syntenous regions among genomes (insensitive for large evolutionary distances)
- Microcolinearity and its exceptions between sorghum and maize. (Tikhonov et al. 1999; Bennetzen 2000)
- Comparative genomics of the eukaryotes. (Rubin et al., 2000)
How?
- Sequence similarity searching algorithms have resulted from a
dialectic process of iterative improvement and refinement.
- Global versus local alignments
- Dot plots
- Algorithms
- Smith & Waterman (1981)
- FASTA (Pearson & Lipman, 1988)
- The BLAST (Altschul et al. 1990 & 1997) family of algorithms
(from NCBI)
| program | query | subject | note |
| blastn | nt | nt | -- |
| blastp | aa | aa | -- |
| blastx | nt | aa | query is translated |
| tblastn | aa | nt | subject is translated |
| tblastx | nt | nt | comparison qua aa |
| blastpgp | aa | aa | dynamic scoring matrix |
- d2: based on tuple frequencies, not pairwise alignments (Hide et al., 1994).
- Scoring matrices
- Evolutionary model schemes
- Chemical similarity models
- Observed substitution schemes (log-odds scores)
- PAM matrices (Dayhoff) "Point Accepted Mutations"; units are substitutions per 100 amino acids; few global alignments.
- BLOSUM matrices (Henikoff & Henikoff) "BLOcks SUbstitution Matrix"; units are threshhold percent similarity; many local alignments.
- What constitutes a significant result?
- Sequence similarity versus sequence homology
- False positive and false negative rates
- Karlin-Altschul statistic
Caveats
- Similarity does not imply homology! Homology is not directly observable.
- Database annotation errors are easily propagated by inferring
homology from similarity and assigning function based on FASTA
description line of a matching sequence. (Boguski, 1999)
- Recognize putative versus experimental qualifiers for evidence of function.
- Choose parameters wisely: default settings do not yield de facto
superior results.
- Choose scoring matrices in an informed manner, trying alternatives.
Exercises
- NCBI BLAST Tutorial
- BLASTX versus BLASTN searches: which is more sensitive?
- BLASTP Parameter sensitivity: what happens to the results when you change the scoring matrix?
Readings
References
- Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990)
Basic local alignment search tool. J. Mol. Biol. 215:403-410.
(PubMed)
- Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z.,
Miller, W. & Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs.
Nucleic Acids Res. 25:3389-3402.
(PubMed)
- Bennetzen, J.L. 2000. Comparative sequence analysis of plant nuclear genomes: microcolinearity and its many exceptions. The Plant Cell 12:1021-1029.
- Boguski, M.S. 1999. Bioequence exegesis. Science 286:453-455.
(PubMed)
- Braun E.L., Halpern A.L., Nelson M.A., Natvig D.O. 2000.
Large-scale comparison of fungal sequence information: mechanisms of
innovation in Neurospora crassa and gene loss in Saccharomyces
cerevisiae. Genome Res. 10(4):416-30.
(PubMed)
- Hide, W., Burke, J., Davison, D.B. 1994. Biological evaluation of d2,
an algorithm for high-performance sequence comparison.
J. Comput. Biol 1(3):199-215.
(Pubmed)
- Pearson, W.R. & Lipman, D.J. 1988. Improved tools for biological
sequence comparison. Proc. Natl. Acad. Sci. USA 85:2444-2448.
(PubMed)
- Rubin, G.M., et al. 2000. Comparative genomics of the eukaryotes. Science 287:2204-2215.
- Sansom, C. 2000.
Database searching with DNA and protein sequences: an introduction.
Briefings in Bioinformatics 1(1):22-32.
- Smith, T.F. & Waterman, M.S. 1981.
Identification of common molecular subsequences.
J. Mol. Biol. 147:195-197.
(PubMed)
- Tikhonov, A.P., et al. 1999. Colinearity and its exceptions in orthologous adh regions of maize and sorghum. Proceedings of the National Academy of Sciences, USA 96:7409-7414.