Big data gets a lot of attention. Fields ranging from cybersecurity to cancer biology to social networks increasingly use behemoth datasets, which can be seen as vast networks. Researchers search those networks for patterns and connections that could help solve problems: Stop hackers, lengthen survival, improve communication.
But there’s a challenge. The noise in high-dimensional datasets can obscure real correlations — and give rise to illusory patterns that don’t mean anything.
In the case of biology, for example, a researcher may sequence the genomes of 100 mice and analyze tens of thousands of genes. That’s a lot of data, but the amount of information per gene — the number of mice — is relatively small. When researchers analyze that dataset, they may find spurious correlations or connections that occur by chance, between genes and disease risk.
“Humans are very good at seeing patterns, even when they’re not there,” says Cristopher Moore, Professor at SFI. “We have a strong tendency toward false positives. Our algorithms do, too.” To better understand the limits of finding meaningful patterns in big data, Moore has organized a working group, to be held at SFI April 2-5. He’s invited an interdisciplinary group of mathematicians, physicists, and theoretical computer scientists to address the problem and devise new algorithms that can succeed all the way up to the limits that arise from not having enough data, or not knowing if the data is accurate.
Moore suggests that networks can undergo a phase transition of sorts, shifting from order to disorder, similar to how ice melts or iron demagnetizes. At low temperatures, the magnetic fields of the atoms in a block of iron mostly align in the same direction. Raise the temperature enough, and the iron’s magnetic strength abruptly drops to zero.
That analogy extends to networks. With enough information about each node — for instance, when a node has links to similar nodes — a network can readily be classified into groups of similar members. But if you add noise by adding nodes with incomplete information or unexpected connections, eventually the noise overwhelms the signal. It becomes impossible or unfeasible to find meaningful patterns.
Recognizing the inherent limits of finding meaning, says Moore, can help researchers map out the difference between real patterns and illusory ones.