Rainer Machne, Douglas B. Murray, and Peter F. Stadler
Paper #: 2017-04-013
Motivation: The segmentation of time series and genomic data is a common problem in computational biology. With increasingly complex measurement procedures individual data points are often not just numbers or simple vectors in which all components are of the same kind. Analysis methods that capitalize on slopes in a single real-valued data track or that make explicit use of the vectorial nature of the data are not applicable in such scenaria.
Results: We develop here a framework for segmentation in arbitrary data domains that only requires a minimal notion of similarity. Using unsupervised clustering of (a sample of) the input yields an approximate segementation algorithm that is efficient enough for genome-wide applications. As a showcase application we segment a time series of transcriptome sequencing data from budding yeast, using a similarity measure focussing on relative expression profile across the metabolic cycle rather then coverage per time point.
Availability: The software is available as an R package from https://github.com/raim/segmenTier