next up previous contents
Next: Bibliography Up: Discovering Molecular Mechanisms of Previous: Sampling and Diversity Estimates   Contents


Appendix B
Thresholding and Quasispecies Diversity

Should classification aim to emphasize relationships? If so, then one tends to be a lumper. Or should classification reflect the power of evolutionary processes to produce differences? Then one will tend to be a splitter.

Most taxonomists try to strike a balance, but in cases when it is not clear where the balance is, sometimes it is necessary to make a philosophical decision, and not everyone will agree on which philosophy to use.
--P. Regal

Sequence similarity searches enable comparative analysis of sequences in a pairwise manner, resulting in optimal global [81], optimum local [111], or near-optimal local [4,5] alignments. This section compares plant and fungal sequences using the BLAST family of similarity search algorithms [4,5]. The purpose is to evaluate alternative parameter sets and justify the parameters used to compute quasispecies diversity in Chapters 3 and 4.

Table B.1 summarizes the sequences on which we will focus. These sequences were analyzed in detail in Chapter 4, and represent several gene family members from a plant, Medicago truncatula, and from two species of arbuscular mycorrhizal fungi from the genus Glomus.

Following procedures described in Chapter 3, I performed two searches with blastn and one search with tblastx [5]. Two different expect value cutoffs were used, such that either $E<10^{-10}$ or $E<10^{-2}$. The tblastx search, which compares nucleotide query and subject sequences as amino acid translations in every possible reading frame, was run with a threshold $E<10^{-2}$. All other parameters were set at their default values, as provided by the pre-compiled binary executable obtained from ftp.ncbi.nlm.nih.gov/blast/executables.

In these searches, both the query and subject set consisted only of those sequences listed in Table B.1. For each match having an expect value smaller (closer to zero) than the threshold, I computed percent identity scores as the ratio of the raw score for matching a subject sequence to the query sequence's self-score, obtained from matching the query sequence with itself. This yields percent identities from zero through 100%. Each sequence always matches itself perfectly, resulting in 100% identity. Partial matches for overlapping fragments have intermediate identity values.

Identity matrices summarize identity values for any pair of query and subject sequences. Figures B.1B.3, and B.5 illustrate identity matrices that resulted from the three searches described above.

In each case, we are interested in knowing how the number of distinct transcripts varies with the degree of stringency required to consider two transcripts the same quasispecies. Percent identity varies continuously, so a percent identity threshold is used as a criterion for lumping two individual transcripts into a common quasispecies.

Dendrograms constructed from each identity matrix appear in Figures B.2B.4, and B.6. These were made using the hclust method in R [62]. How does quasispecies diversity vary with the identity threshold, using varied search algorithms and parameters?

In the first case, using BLASTN with $E<10^{-10}$ (Figures B.1 and B.2), quasispecies observed diversity at 90%, 70%, 50%, and 30% threshold identity values is 22, 21, 21, and 20, respectively.

This is the case for which results are summarized in Chapters 3 and 4. Let us now consider two simple alternatives.

In the second case, using BLASTN with $E<10^{-2}$ (Figures B.3 and B.4), quasispecies diversity is the same as in the previous case. However, one sequence (GvCHS1) has joined a cluster of other chitinases from Glomus, having a very weak similarity that was excluded when using a higher expect value threshold. A weak match between this sequence and a plant chitinase (raw score = 30, E=0.008) is also observed in the identity matrix, though it does not appear in the dendrogram. The two sequences match perfectly over a span of 14 nt.

Here, the two BLASTN searches resulted in the same observed diversity. It is easy to imagine that other weak matches might occur in large-scale comparisons. Whether or not these would affect the observed diversity is unclear. I chose to use the more conservative of the two parameter choices, to minimize spurious clustering based on weak matches.

In the third case, using TBLASTX with $E<10^{-2}$ (Figures B.5 and B.6), we observe many more matches. This indicates that identities as amino acids are more readily identified than as nucleic acids. In the dendrogram, clusters have more constituents; fewer singleton clusters are apparent. All but one of the chitin synthases from G. intraradices and G. versiforme cluster together. The fungal phosphate transporter clusters with the two plant transporters, though the degree of identity is too low to be counted as a single quasispecies. Quasispecies diversity ranges from 24 at 90% identity to 16 at 30% identity.

Comparing nucleotides as amino acid translations results in greater sensitivity than comparing them as nucleotides. However, because of the tendency to join transcripts that originated from the genomes of different, reproductively isolated species, considering them as constituents of the same transcript quasispecies is a dubious procedure.

Because of observations such as those described here, I chose to use the first set of parameters in BLAST comparisons to compute observed diversity of transcript quasispecies. Different proteins evolve at different rates [48,69,70], so I thought it appropriate to report results at varied percent identity thresholds.


Table B.1: Fungal and plant sequences subjected to various thresholding criteria. The ID column gives the symbol by which sequences are identified in identity matrix figures; the LOCUS column gives the symbol by which sequences are identified in dendrograms.
ID $L$ LOCUS ACCESSION GENE NAME
Glomus intraradices
a 1532 GiHB1 AF110198 homeobox protein HB1
b 858 GiMYC2 AF110197 MYC2
c 1453 GiMYC1 AF110196 MYC1
d 610 GiCHS1 L77908 chitin synthase
e 617 GiBCHS1 AF260996 chitin synthase, isolate GiBCHS1
f 614 GiCHS3 AF260993 chitin synthase, isolate GiCHS3
g 617 GiCHS2 AF260986 chitin synthase, isolate GiCHS2
h 617 GiBCHS2 AF260985 chitin synthase, isolate GiBCHS2
i 617 GiVCHS2 AF260983 chitin synthase, isolate GiVCHS2
j 617 GiWCHS2 AF260982 chitin synthase, isolate GiWCHS2
         
Glomus versiforme
k 4116 GvCHS3 AJ009630 chitin synthase, clone Gvchs3
l 481 GvCHS2 AJ009629 chitin synthase, clone Gvchs2
m 638 GvCHS1 AJ009628 chitin synthase, clone Gvchs1
n 1833 GvPT U38650 phosphate transporter
         
Medicago truncatula
o 1867 MtPT2 AF000355 phosphate transporter MtPT2
p 1920 MtPT1 AF000354 phosphate transporter MtPT1
q 954 Mt4 AF055921 Mt4
r 1305 MtCHI1 Y10373 chitinase
s 181 MtCHI08g AF167329 chitinase, clone T130008g
t 265 MtCHI07g AF167328 chitinase, clone T130007g
u 188 MtCHI06g AF167327 chitinase, clone T130006g
v 188 MtCHI05g AF167326 chitinase, clone T130005g
w 191 MtCHI04g AF167325 chitinase, clone T130004g
x 197 MtCHI03g AF167324 chitinase, clone T130003g
y 260 MtCHI02g AF167323 chitinase, clone T130002g
x 245 MtCHI01g AF167322 chitinase, clone T130001g

Figure B.1: Identity matrix using BLASTN and $E<10^{-10}$. Area of a box is proportional to the percent identity.
\begin{figure}\begin{center}
\psfig{file=appendix/thresh/figs/blastn10-dot.ps,height=3in}\end{center}\end{figure}

Figure B.2: Dendrogram using BLASTN and $E<10^{-10}$. Quasispecies diversity at 90%, 70%, 50% and 30% identity thresholds are 22, 21, 21, and 20, respectively.

Figure B.3: Identity matrix using BLASTN and $E<10^{-2}$. Area of a box is proportional to the percent identity.
\begin{figure}\begin{center}
\psfig{file=appendix/thresh/figs/blastn2-dot.ps,height=3in}\end{center}\end{figure}

Figure B.4: Dendrogram using BLASTN and $E<10^{-2}$. Quasispecies diversity at 90%, 70%, 50% and 30% identity thresholds are 22, 21, 21, and 20, respectively.

Figure B.5: Identity matrix using TBLASTX and $E<10^{-2}$. Area of a box is proportional to the percent identity.
\begin{figure}\begin{center}
\psfig{file=appendix/thresh/figs/tblastx2-dot.ps,height=3in}\end{center}\end{figure}

Figure B.6: Dendrogram using TBLASTX and $E<10^{-2}$. Quasispecies diversity at 90%, 70%, 50% and 30% identity thresholds are 24, 22, 22, and 16, respectively.


next up previous contents
Next: Bibliography Up: Discovering Molecular Mechanisms of Previous: Sampling and Diversity Estimates   Contents
Peter T. Hraber 2001-06-13