Alan Lapedes, Paul Stolorz, Yuan Xia

Paper #: 95-02-014

A comparison of neural network methods, and Bayesian statistical methods, is presented for prediction of the secondary structure of proteins given their primary sequence. The Bayesian method makes the unphysical assumption that the probability of an amino acid occurring in each position in the protein is independent of the amino acids occurring elsewhere. However, we find the predictive accuracy of the Bayesian method to be only minimally less than the accuracy of the most sophisticated methods used to date. We present the relationship of neural network methods to Bayesian statistical methods and show that, in principle, neural methods offer considerable power, although apparently it is not particularly useful for this problem. In the process, we derive a neural formalism in which the output neurons directly represent the conditional probabilities of structure class. The probabilistic formalism allows introduction of a new objective function, the mutual information, which translates the notion of correlation as a measure of predictive accuracy into a useful training measure. Although a similar accuracy to other approaches (utilizing a Mean Square Error) is achieved using this new measure, the accuracy on the training set is significantly, and tantalizingly, higher, even though the number of adjustable parameters remains the same. The mutual information measure predicts a greater fraction of helix and sheet structures correctly than the mean square error measure, at the expense of coil accuracy--precisely as it was designed to do. By combining the two objective functions, we obtain a marginally improved accuracy of 64.4%, with Mathews coefficients $C_\alpha, C_\beta$ and $C_coil$ of 0.40, 0.32 and 0.42 respectively. However, since all methods to date perform only slightly better than the Bayes algorithm which entails the drastic assumption of independence of amino acids, one is forced to conclude that little progress has been made on this problem despite the application of a variety of sophisticated algorithms such as neural networks, and that further advances will require a better understanding of the relevant biophysics.

PDF