Bailin Hao

Paper #: 06-10-037

All the symbols and symbolic sequences we use in science appear as result of coarse-grained description of Nature. According to a theorem in C. Shannon's seminal 1948 paper the set of all symbolic sequences of length N over a finite alphabet may be roughly divided into two subsets: a huge typical set and a tiny set of atypical sequences. Biological sequences as result of billion years of evolution must belong to this tiny set. An effective way of studying the atypical set is to look at real data. We will report some observations on real DNA and protein data, including "avoidance signature" of bacterial genomes, taxon-specific repeats in these genomes, fine structure in the number distribution of K-strings in randomized genomes, almost uniqueness of reconstruction of protein sequences from their constituent K-peptides, etc. These observations may sometimes lead to interesting pieces of biology-inspired mathematics related to combinatorics, graph theory and formal language theory.

PDF