Walter Fontana, Danielle Konings, Peter Schuster, Peter Stadler

Paper #: 92-02-007

Large ensembles of RNA sequences are folded into secondary structures with minimum free energies. Four nucleotide alphabets are used: two binary alphabets, AU and GC, the biophysical AUGC and the synthetic GCXK alphabet. They define base-pairing rules, and by their physical nature also the strengths of the base pair interactions. All quantities presented here depend strongly on the particular alphabet chosen. RNA secondary structues are partitioned into structural elements, such as stacks, loops, joints and free ends. Statistical properties of these elements are computed for different chain lengths up to $\nu = 100$. The results obtained from the statistics of random ensembles are compared with the data derived from natural RNA molecules with similar base frequencies. Secondary structures are represented as trees. A quantitative measure for the distance between two structures, the “tree distance $d_t$,” is obtained by means of tree editing. Two different, but formally equivalent tree representations are introduced and compared in actual computations of RNA structures. We introduce a structure density surface as the conditional probability $P(t|h)$ of two structures having tree distance $(d_t = t)$ given that the sequences that fold into them have Hamming distance $(d_h = h)$. Structure density surfaces provide insight into the “shape space” of RNA secondary structures. Nearly the entire range of tree distances is covered with considerable probability already at small Hamming distances from a typical sequence. This suggests that the vast majority of possible structures occur within a fairly small neighborhood of any random sequence. Correlation lengths for secondary structure in their tree representations are computed from probability densities. They are appropriate measures of the complexity or “ruggedness” of structure landscapes.