Finding meaning in big data

A network of political blogs, subdivided into groups of similar members. (Image: Zhang and Moore, 2014)

April 2, 2018

Big data gets a lot of attention. Fields ranging from cybersecurity to cancer biology to social networks increasingly use behemoth datasets, which can be seen as vast networks. Researchers search those networks for patterns and connections that could help solve problems: Stop hackers, lengthen survival, improve communication.

But there’s a challenge. The noise in high-dimensional datasets can obscure real correlations — and give rise to illusory patterns that don’t mean anything.

In the case of biology, for example, a researcher may sequence the genomes of 100 mice and analyze tens of thousands of genes. That’s a lot of data, but the amount of information per gene — the number of mice — is relatively small. When researchers analyze that dataset, they may find spurious correlations or connections that occur by chance, between genes and disease risk.

“Humans are very good at seeing patterns, even when they’re not there,” says Cristopher Moore, Professor at SFI. “We have a strong tendency toward false positives. Our algorithms do, too.” To better understand the limits of finding meaningful patterns in big data, Moore has organized a working group, to be held at SFI April 2-5. He’s invited an interdisciplinary group of mathematicians, physicists, and theoretical computer scientists to address the problem and devise new algorithms that can succeed all the way up to the limits that arise from not having enough data, or not knowing if the data is accurate.

Moore suggests that networks can undergo a phase transition of sorts, shifting from order to disorder, similar to how ice melts or iron demagnetizes. At low temperatures, the magnetic fields of the atoms in a block of iron mostly align in the same direction. Raise the temperature enough, and the iron’s magnetic strength abruptly drops to zero.

That analogy extends to networks. With enough information about each node — for instance, when a node has links to similar nodes — a network can readily be classified into groups of similar members. But if you add noise by adding nodes with incomplete information or unexpected connections, eventually the noise overwhelms the signal. It becomes impossible or unfeasible to find meaningful patterns.

Recognizing the inherent limits of finding meaning, says Moore, can help researchers map out the difference between real patterns and illusory ones.

Read more about the working group "Limits to Inference in Networks and Noisy Data."

More SFI News

View All News

Finding meaning in big data

April 2, 2018

Share

News Media Contact

Santa Fe Institute

Tags

Related Themes

Related Projects

More SFI News

Karen Willcox Winner of the 2024 Theodore von Kármán Prize

Tim Kohler to deliver Linda S. Cordell Lecture

To accelerate biosphere science, reconnect three scientific cultures

Mirta Galesic receives prestigious ERC Advanced Grant

Carlo Rovelli receives 2024 Lewis Thomas Prize

Research News Brief: Defining a city using cell-phone data

Complexity tools for USDA nutritional guidelines

Quantifying the potential value of data

Carlo Rovelli joins SFI's Fractal Faculty

New book offers thoughtful approach to modeling complex social systems

Research News Brief: A test of AI “personalities” and behavior

Study: To make sense of history, embrace uncertainty

Study: Predicting steps in a random process

Embodied intelligence & a sense of self

How to track important changes in a dynamic network

African and South Asian students build new connections during inaugural Complexity Global School

New gifts support SFI Education and Postdoctoral programs

The cultural evolution of collective property rights

Applications for Complexity Global School are now open

Life as a planetary regulator: an experimental test