Information theory holds surprises for machine learning

Examples from the MNIST handwritten digits database (Image: Josef Steppan)

January 24, 2019

New SFI research challenges a popular conception of how machine learning algorithms “think” about certain tasks.

The conception goes something like this: because of their ability to discard useless information, a class of machine learning algorithms called deep neural networks can learn general concepts from raw data — like identifying cats generally after encountering tens of thousands of images of different cats in different situations. This seemingly human ability is said to arise as a byproduct of the networks’ layered architecture. Early layers encode the “cat” label along with all of the raw information needed for prediction. Subsequent layers then compress the information, as if through a bottleneck. Irrelevant data, like the color of the cat’s coat, or the saucer of milk beside it, is forgotten, leaving only general features behind. Information theory provides bounds on just how optimal each layer is, in terms of how well it can balance the competing demands of compression and prediction.

“A lot of times when you have a neural network and it learns to map faces to names, or pictures to numerical digits, or amazing things like French text to English text, it has a lot of intermediate hidden layers that information flows through,” says Artemy Kolchinsky, an SFI Postdoctoral Fellow and the study’s lead author. “So there’s this long-standing idea that as raw inputs get transformed to these intermediate representations, the system is trading prediction for compression, and building higher-level concepts through this information bottleneck.”

However, Kolchinsky and his collaborators Brendan Tracey (SFI, MIT) and Steven Van Kuyk (University of Wellington) uncovered a surprising weakness when they applied this explanation to common classification problems, where each input has one correct output (e.g., in which each picture can either be of a cat or of a dog). In such cases, they found that classifiers with many layers generally do not give up some prediction for improved compression. They also found that there are many “trivial” representations of the inputs which are, from the point of view of information theory, optimal in terms of their balance between prediction and compression.

“We found that this information bottleneck measure doesn’t see compression in the same way you or I would. Given the choice, it is just as happy to lump 'martini glasses' in with ‘Labradors', as it is to lump them in with 'champagne flutes,’” Tracey explains. “This means we should keep searching for compression measures that better match our notions of compression.”

While the idea of compressing inputs may still play a useful role in machine learning, this research suggests it is not sufficient for evaluating the internal representations used by different machine learning algorithms.

At the same time, Kolchinsky says that the concept of trade-off between compression and prediction will still hold for less deterministic tasks, like predicting the weather from a noisy dataset. “We’re not saying that information bottleneck is useless for supervised [machine] learning,” Kolchinsky stresses. “What we’re showing here is that it behaves counter-intuitively on many common machine learning problems, and that’s something people in the machine learning community should be aware of.”

The paper has been accepted to the 2019 International Conference on Learning Representations (ICLR 2019).

Information theory holds surprises for machine learning

January 24, 2019

Share

News Media Contact

Santa Fe Institute

Tags

Related Themes

More SFI News

In memoriam: Daniel C. Dennett

New Book: The time for complexity economics has come

Karen Willcox Winner of the 2024 Theodore von Kármán Prize

Tim Kohler to deliver Linda S. Cordell Lecture

To accelerate biosphere science, reconnect three scientific cultures

Mirta Galesic receives prestigious ERC Advanced Grant

Carlo Rovelli receives 2024 Lewis Thomas Prize

Research News Brief: Defining a city using cell-phone data

Complexity tools for USDA nutritional guidelines

Quantifying the potential value of data

Carlo Rovelli joins SFI's Fractal Faculty

New book offers thoughtful approach to modeling complex social systems

Research News Brief: A test of AI “personalities” and behavior

Study: To make sense of history, embrace uncertainty

Study: Predicting steps in a random process

Embodied intelligence & a sense of self

How to track important changes in a dynamic network

African and South Asian students build new connections during inaugural Complexity Global School

New gifts support SFI Education and Postdoctoral programs

The cultural evolution of collective property rights