Complex energy landscapes are ubiquitous in physics and biology: they guide important processes such as protein folding and cell differentiation, offering a natural picture to reason about the evolution of complex systems with multiple possible final states (i.e., attractors, or local minima). Understanding from which initial conditions can each local minimum be reached is important for the analysis of these complex landscapes. Basins of attraction naturally emerges as a fundamental concept in this regard, as they map initial conditions to corresponding attractors. However, due to the high dimensionality, the geometry of basins is typically not well understood—even characterizing the sizes of basins has been challenging. Recently, it was found that basins in large oscillator networks are often highly convoluted and octopus-like, with long “tentacles” that reach far and wide throughout the landscape.
The learning process of a neural network can be understood through its loss landscape. Here, the loss is given by the number of mistakes a neural network makes on the training data set. As we train the neural network, it moves downhill in the loss landscape by modifying the weights of the connections between neurons. Given that a typical neural network has millions of connections, the training happens in an extremely high-dimensional space. In this space, the neural network needs to navigate a rugged loss landscape with many local minima. Not all local minima have good generalization performances— some of them perform poorly on the unseen validation data set despite low training errors. What distinguish minima with good generalization performance from those with bad generalization performance? Previous studies have linked generalization performance to the flatness of local minima—wide flat minima tend to perform better than narrow and sharp ones. The unreasonable effectiveness of modern neural networks can be partially explained by the fact that training algorithms such as stochastic gradient descent are biased towards finding flat minima. Going one step further than local flatness, we will characterize the structure of basins and offer a more global perspective on how neural networks learn. Specifically, we hypothesize that basins of bad minima are localized and bounded in a small region of the parameter space, making them invisible to a typical training algorithm. In contrast, good minima not just have relatively flat regions in their vicinity, their basins also have tentacles that extend far and wide throughout the parameter space. This makes them easy to detect, and once a tentacle is reached, it can guide the neural network towards the desired minimum.