Skip to content

Opening the black box of deep neural networks via information – part I

November 15, 2017

Opening the black box of deep neural networks via information Schwartz-Viz & Tishby, ICRI-CI 2017

In my view, this paper fully justifies all of the excitement surrounding it. We get three things here: (i) a theory we can use to reason about what happens during deep learning, (ii) a study of DNN learning during training based on that theory, which sheds a lot of light on what is happening inside, and (iii) some hints for how the results can be applied to improve the efficiency of deep learning – which might even end up displacing SGD in the later phases of training.

Despite their great success, there is still no comprehensive understanding of the optimization process or the internal organization of DNNs, and they are often criticized for being used as mysterious “black boxes”.

I was worried that the paper would be full of impenetrable math, but with full credit to the authors it’s actually highly readable. It’s worth taking the time to digest what the paper is telling us, so I’m going to split my coverage into two parts, looking at the theory today, and the DNN learning analysis & visualisations tomorrow.

An information theory of deep learning

Consider the supervised learning problem whereby we are given inputs x \in X and we want to predict labels y \in Y. Inside the network we are learning some representation T(X) of the input patterns, that we hope enables good predictions. We also want good generalisation, not overfitting.

Think of a whole layer T as a single random variable. We can describe this layer by two distributions: the encoder P(T|X) and the decoder P(Y|T).

So long as these transformations preserve information, we don’t really care which individual neurons within the layers encode which features of the input. We can capture this idea by thinking about the mutual information of T with the input X and the desired output Y.

Given two random variables X and Y, their mutual information I(X;Y) is defined based on information theory as
\displaystyle I(X;Y) = H(X) - H(X|Y)

Where H(X) is the entropy of X and H(X|Y) is the conditional entropy of X given Y.

The mutual information quantifies the number of relevant bits that the input X contains about the label Y, on average.

If we put a hidden layer T between X and Y then T is mapped to a point in the information plane with coordinates (I(X;T), I(T;Y)). The Data Processing Inequality (DPI) result tells us that for any 3 variables forming a Markov chain X \rightarrow Y \rightarrow Z we have I(X;Y) \geq I(X;Z).

So far we’ve just been considering a single hidden layer. To make a deep neural network we need lots of layers! We can think of a Markov chain of K-layers, where T^i denotes the i^{th} hidden layer.

In such a network there is a unique information path which satisfies the DPI chains:

\displaystyle I(X;Y) \geq I(T_1;Y) \geq I(T_2;Y) \geq ... \geq I(T_k;Y) \geq I(\hat{Y};Y)

and
\displaystyle H(X) \geq I(X;T_1) \geq I(X;T_2) \geq ... \geq I(X;T_k) \geq I(X;\hat{Y})

Now we bring in another property of mutual information; it is invariant in the face of invertible transformations:

\displaystyle I(X;Y) = I(\psi (X); \phi(Y))

for any invertible functions \psi and \phi.

And this reveals that the same information paths can be realised in many different ways:

Since layers related by invertible re-parameterization appear in the same point, each information path in the plane corresponds to many different DNN’s , with possibly very different architectures.

Information bottlenecks and optimal representations

An optimal encoder of the mutual information I(X;Y) would create a representation of a minimal sufficient statistic of X with respect to Y. If we have a minimal sufficient statistic then we can decode the relevant information with the smallest number of binary questions. (That is, it creates the most compact encoding that still enables us to predict Y as accurately as possible).

The Information Bottleneck (IB) tradeoff Tishby et al. (1999) provides a computational framework for finding approximate minimal sufficient statistics. That is, the optimal tradeoff between the compression of X and the prediction of Y.

The Information Bottleneck tradeoff is formulated by the following optimization problem, carried independently for the distributions, p(t|x), p(t), p(y|t), with the Markov chain: Y \rightarrow X \rightarrow T

\displaystyle \min_{p(t|x),p(y|t),p(t)} \{ I(X;T) - \beta I(T;Y)\}

where the Lagrange multiplier \beta determines the level of relevant information captured by the representation T, I(T;Y).

The solution to this problem defines an information curve: a monotonic concave line of optimal representations that separates the achievable and unachievable regions in the information plane.

Noise makes it all work

Section 2.4 contains a discussion on the crucial role of noise in making the analysis useful (which sounds kind of odd on first reading!). I don’t fully understand this part, but here’s the gist:

The learning complexity is related to the number of relevant bits required from the input patterns X for a good enough prediction of the output label Y, or the minimal I(X; \hat{X}) under a constraint on I(\hat{X}; Y) given by the IB.

Without some noise (introduced for example by the use of sigmoid activation functions) the mutual information is simply the entropy H(Y) independent of the actual function we’re trying to learn, and nothing in the structure of the points p(y|x) gives us any hint as to the learning complexity of the rule. With some noise, the function turns into a stochastic rule, and we can escape this problem. Anyone with a lay-person’s explanation of why this works, please do post in the comments!

Setting the stage for Part II

With all this theory under our belts, we can go on to study the information paths of DNNs in the information plane. This is possible when we know the underlying distribution P(X,Y) and the encoder and decoder distributions P(T|X) and P(Y|T) can be calculated directly.

Our two order parameters, I(T; X) and I(T;Y), allow us to visualize and compare different network architectures in terms of their efficiency in preserving the relevant information in P(X;Y).

We’ll be looking at the following issues:

  1. What is SGD actually doing in the information plane?
  2. What effect does training sample size have on layers?
  3. What is the benefit of hidden layers?
  4. What is the final location of hidden layers?
  5. Do hidden layers form optimal IB representations?
11 Comments leave one →
  1. buriy permalink
    November 15, 2017 2:46 pm

    Have you found how to calculate this I(X,T_i) in practice, say, for MNIST and FC layers?

    • November 15, 2017 3:39 pm

      In section 2.5 of the paper, the authors describe the criteria under which this can be done. They only do so for a strictly controlled environment, leaving the more general case to future work – though they give some hints: “As proposed by Tishby and Zaslavsky (2015), we study the information paths of DNNs in the information plane. This can be done when the underlying distribution, P (X, Y ), is known and the encoder and decoder distributions P (T |X ) and P (Y |T ) can be calculated directly. For ”large real world” problems these distributions and mutual information values should be estimated from samples or by using other modeling assumptions. Doing this is beyond the scope of this work, but we are convinced that our analysis and observations are general, and expect the dynamics phase transitions to become even sharper for larger networks, as they are inherently based on statistical ensemble properties. Good overviews on methods for mutual information estimation can be found in Paninski (2003) and Kraskov et al. (2004).”

  2. Andre permalink
    November 15, 2017 3:33 pm

    Thank you so much for reviewing this topic! You mention “If we put a hidden layer T between X and Y then T is mapped to a point in the information plane with coordinates (I(X;T), I(T;Y))”

    I’m not yet very familiar with Information Theory beyond entropy. What does the information plane mean? And what does it mean to map a point in the information plane with some coordinates? This topic is really fascinating!

    • November 15, 2017 3:37 pm

      By information plane here, the authors are referring (as I understand it!), to the 2D plane created with the mutual information between the input and the hidden layer on the x-axis, and the mutual information between the output and the hidden layer on the y-axis. Another way of thinking about what this phrase means is just “when we think about what’s happening inside of the neural network in terms of information…”. Part II (posted tomorrow) contains a number of example plots etc., that will hopefully help to make all this a little clearer.

      • Andre permalink
        November 15, 2017 3:44 pm

        Looking forward to it! Thanks a lot for this review, reading someone else’s perspective helps. I still need to wrap my head around the whole idea of information theory

  3. unohoo permalink
    November 15, 2017 6:02 pm

    Great post. Do you know of any other blogs/podcasts/youtube channels doing something similar to your morning paper, or 2 minute papers on youtube but in other fields ?
    It’s a great idea, I’d love to be introduced the gist of papers from fields outside of computer science.

  4. November 16, 2017 5:08 pm

    About adding noise to a model, if it’s the same concept that I envision, you’d be adding extra bits to an existential set of data to generalize the data structure for pattern recognition. It might help to take an example, It’s like adding noise to a clear photo in Photoshop, if you use the filter to add noise to a picture you see more blocked edges on the photo, this allows for more generalized manipulations of the image because extra bits are added onto the smooth edges. A graphics artist can add effects to an image because the separations become more distinct between objects.

  5. November 16, 2017 5:29 pm

    Quote: “… Without some noise (introduced for example by the use of sigmoid activation functions) …”

    I do not see how sigmoid functions could introduce noise by them selves. The way I see it is more: the noise is introduced by the random initialization of the parameters, the variance of this noise is THEN amplified by the sigmoid function (or any other non-linearity).

Trackbacks

  1. Opening the black box of deep neural networks via information – Part II | the morning paper
  2. What Happens In Deep Neural Networks? – Curated SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: