C. Barnes, C. Burks, Robert Farber, Alan Lapedes, Karl Sirotkin

Paper #: 95-02-011

In this article we report initial, quantitative results on application of simple neural networks and simple machine learning methods, to two problems in DNA sequence analysis. The two problems we consider are: (1) Determination of whether procaryotic and eucaryotic DNA sequences segments are translated to protein. An accuracy of 99.4% is reported for procaryotic DNA (E. coli) and 98.4% for eucaryotic DNA (“H. sapiens” genes known to be expressed in liver). (2) Determination of whether eucaryotic DNA sequence segments containing the dinucleotides “AG” or “GT” are transcribed to RNA splice junctions. An accuracy of 1.2% was achieved on intron/exon splice junctions (acceptor sites) and 94.5% on exon/intron splice junctions (donor sites). The solution of these two problems, by use of information-processing algorithms operating on unannotated base sequences and without recourse to biological laboratory work, is relevant to the Human Genome Project. A variety of neural network, machine learning, and information theoretic algorithms are used. (For the purposes of this article, we view neural networks solely as an information processing procedure and do not consider the possible relation of these formal models to biological networks of neurons.) The accuracies obtained exceed those of previous investigations for which quantitative results are available in the literature. They result from an ongoing program of research that applies machine learning algorithms to the problem of determining biological function of DNA sequences. Some predictions of possible new genes using these methods are listed--although a complete survey of the H. sapiens and E. coli sections of GenBank using these methods will be given elsewhere.

PDF