B. Giraud, Alan Lapedes, Long Liu

Paper #: 98-10-092

A criterion based on conditional probabilities, related to the concept of algorithmic distance, is used to detect correlated mutations at noncontiguous sites on sequences. We apply this criterion to the problem of analyzing correlations between sites in protein sequences, however, the analysis applies generally to networks of interacting sites with discrete states at each site. Elementary models, where explicit results can be derived easily, are introduced. The number of states per site considered ranges from two, illustrating the relation to familiar classical spin systems, to twenty states, suitable for representing amino acids. Numerical simulations show that the criterion remains valid even when the genetic history of the data samples (e.g., protein sequences), as represented by a phylogenetic tree, introduces nonindependence between samples. Statistical fluctuations due to finite sampling are also investigated and do not invalidate the criterion. A subsidiary result is found: the more homogeneous a population, the more easily its average properties can drift from the properties of its ancestor.