James Crutchfield, Cosma Shalizi, Kristina Shalizi

Paper #: 02-10-060

We present a new algorithm for discovering patterns in time series and other sequential data. We exhibit a reliable procedure for building the minimal set of hidden, Markovian states that is statistically capable of producing the behavior exhibited in the data--the underlying process' causal states. Unlike conventional methods for fitting hidden Markov models (HMMs) to data, our algorithm makes no assumptions about the process' causal architecture (the number of hidden states and their transition structure), but rather infers it from the data. It starts with assumptions of minimal structure and introduces complexity only when the data demand it. Moreover, the causal states it infers have important predictive optimality properties that conventional HMM states lack. Here, in Part I, we introduce the algorithm, review the theory behind it, prove its asymptotic reliability, and use large deviation theory to estimate its rate of convergence. In the sequel, Part II, we outline the algorithm's implementation, illustrate its ability to discover even “difficult” patterns, and compare it to various alternative schemes.