Information: A Tool for Understand Deep Learning

The dramatic increase of computing power coupled with the exponential expansion of training data is powering Deep Learning (DL), a subset of Artificial Intelligence (AI). As a result, the extent of its capabilities and problem-solving abilities are becoming more powerful and vast. Both academia and industry are turning to DL to fund research and solve existing business problems. With the aim of helping you gain a deeper understanding of its inner workings, this post will provide an overview of some of Naftali Tishby’s recent work, a leader in machine learning research and computational neuroscience.

The Current State of Deep Learning

Despite its wide adoption, there remains little theoretical understanding of DL – much of what we know is derived solely from applied algorithms, rather than theory. In fact, DL model selection and parameter tuning still belong to experimental tasks rather than provable and justifiable problems. And though there are some vague rules for model selection and tuning, there is rarely a provable theoretical conclusion.

Tishby’s Advancements in Deep Learning

In 2017, Naftali Tishby and his students made some significant advancements in providing a theoretical conclusion and understanding of DL via information[i]. By the visualization of Deep Neural Networks (DNN) on the Information Plane – this is the place where each layer preserves mutual information values on the input and output variables – it was shown that the training of a DNN can be divided into two phases: The Empirical Error Minimization (ERM) phase and the Representation Compression (RC) phase. They also illustrated the effect that the number of layers in a DNN and the size of its training set will have on results. It’s worth noting that Tishby and his team arrived at these conclusions by observing the behaviour of a DNN during training, rather than basing them on strict theoretical proof. There are two further observations that are interesting here:

Deep Neural Networks learn like humans.The two phases observed in training are strikingly consistent with the stages of human learning. On one hand, the ERM phase is consistent with how humans learn from concrete examples. For instance, when we were kids and learned the concept of the number “1”, we were taught that “one apple” is “1” and “one banana” is also “1”. Throughout the learning process, we were given multiple examples of these instances and tried to memorize them all. On the other hand, the RC phase is consistent with how humans learn abstract concepts. Based on all the concrete examples that we learn of the number “1”, we are gradually able to grasp the abstract concept of the number.
Information theory will help us understand Machine Learning (ML).The work by Tishby and his team also illustrated that concepts in information theory, such as entropy and mutual information, are useful in our understanding of ML. Historically, there has been an understanding that there is a close connection between ML and information theory and that the latter can be used to analyze ML models. This is true because finding a good ML model closely resembles the process of finding a good encoder, which is a typical program in information theory. However, the ability to prove a conclusion in ML using information theory is very rare. And, before Tishby, some ideas and concepts from information theory were only used in ML in a very restricted manner. For example, the concept of entropy was only used for a loss function in ML. What Tishby and his team brought to the table is showing that information is a valuable tool when observing the behaviour of DL models.

What the Future Holds

Though there have been successful efforts after Tishby’s publication to demonstrate that his findings cannot be generalized to all DNNs[ii], I want to acknowledge that his work has helped the community take a large step forward to better understanding DL. Next, I expect that information theory will be used to analyze DL models and provide a means to achieve provable results for the model selection of different ML problems. I also expect that we will soon be able to not only ‘open the black box of deep neural networks’ but also ‘shine a light’ on them to gain an even deeper understanding.

[i] Shwartz-Ziv R, Tishby N. Opening the black box of deep neural networks via information[J]. arXiv preprint arXiv:1703.00810, 2017.

[ii] Saxe A M, Bansal Y, Dapello J, et al. On the information bottleneck theory of deep learning[J]. 2018.