**Post: #1**

The Information Bottleneck: Theory and Applications

Information Bottleneck:.pdf (Size: 1.34 MB / Downloads: 57)

Abstract

This thesis introduces the first comprehensive review of the Information Bottleneck (IB) method along

with its recent extension, the multivariate IB. The IB method was originally suggested in [82] as a new

information-theoretic approach for data analysis. The basic idea is surprisingly simple: Given a joint distri-

μ, find a compressed representation of , denoted by , that is as informative as possible about

bution ́

. This idea can be formulated as a variational principle of minimizing the mutual information, ́

μ

(which controls the compactness of the representation ), under some constraint on the minimal level of

μ . Hence, the fundamental trade-off between

mutual information that preserves about , given by ́

the complexity of the model and its precision is expressed here in an entirely symmetric form, where the

exact same concept of information controls both its sides. Indeed, an equivalent posing of the IB principle

μ

would be to maximize the information maintains about , where the (compression) information ́

is constrained to some maximal level.

As further shown in [82], this constrained optimization problem can be considered analogous to rate distor-

tion theory, but with an important distinction: the distortion measure does not need to be defined in advance,

μ. Moreover, it leads to a tractable mathematical

but rather naturally emerges from the joint statistics, ́

analysis which provides a formal characterization of the optimal solution to this problem. As an immedi-

ate implication, the IB method formulates a well defined information-theoretic framework for unsupervised

clustering problems, which is the main focus of this thesis. Nonetheless, it is important to keep in mind

that the same underlying principle of a trade-off between information terms may have further implications

in other related fields, as recently suggested in [37].

Introduction

In this chapter we provide a basic introduction to the remaining chapters. In the first section we present high

level descriptions of the fundamental trade-off between precision and complexity. One important variant of

this trade-off is formulated as the problem of unsupervised clustering, which is the main problem we address

in this thesis. In the next section we present the necessary preliminaries for our analysis. We conclude this

chapter by presenting a simple example in order to elucidate the central ideas that will be discussed later on.

Preliminaries

In this section we introduce the basic concepts required for the next chapters. We start with some nota-

divergence and

tions, and further state the definitions of entropy, mutual and multi information,

divergence. Most of this section is based on Chapter 3⁄4 in [20], Chapter ¿ in [5] (which provides a friendly

introduction to the concept of entropy), and a work in progress by Nemenman and Tishby [55], which

introduces a new axiomatic derivative of mutual and multi-information.

Entropy and related concepts

Consider the following situation. We are given a finite collection of documents, denoted by

A person chooses to read a single document out of this collection, and our task is to guess which document

was chosen. Without any prior knowledge, all guesses are equally likely. We now further assume that we

3⁄4 ,

have access to a definite set of (exhaustive and mutually exclusive) probabilities, denoted by ́ μ

for all the possible choices. For example, let us assume that longer documents are more probable than shorter

ones. More specifically, that the probability of choosing each document is proportional to the (known) num-

ber of words that occur in it. If all the documents consist of exactly the same number of words, ́ μ is

uniform and obviously we are back at the starting point where no guess is preferable. However, if one docu-

ment is much longer than all the others, ́ μ will have a clear peak for this document, hence our chances of

providing the correct answer will improve. How can we quantify the difference between these two scenarios

in a well defined way?

Mutual information and multi-information

Let us reconsider our previous example of trying to guess what document was chosen. However, we now

assume that we have access not only to the prior distribution ́ μ, but rather to a joint distribution of

with some other random variable, . For concreteness, if values correspond to all the possible document

values correspond to all the distinct words occurring in this document

identities, let us assume that

μ

collection. Thus, more formally stated, we assume that we have access to the joint distribution ́

which indicates the probability that a random word position in the corpus is equal to 3⁄4

while the

document identity is 3⁄4 . 3