Saturday, December 13 Morning session

NIPS 2003 Workshop onInformation Theory and Learning:

The Bottleneck and Distortion Approach

Organizers:

Thomas Gedeon Naftali Tishby

http://www.cs.huji.ac.il/~tishby/NIPS-Workshop

Saturday, December 13Morning session

7:30-8:20 N. Tishby Introduction – A new look at Shannon’s Information Theory 8:20-9:00 T. Gedeon Mathematical structure of Information Distortion methods 9:00-9:30 D. Miller Deterministic annealing9:30-10:00 N. Slonim Multivariate Information Bottleneck10:00-10:30 B. Mumey Optimal Mutual Information Quantization is NP-complete

Afternoon session16:00-16:20 G. Elidan The Information Bottleneck EM algorithm16:20-16:40 A. Globerson Sufficient Dimensionality Reduction with Side Information16:40-17:00 A. Parker Phase Transitions in the Information Distortion 17:00-17:20 A. Dimitrov Information Distortion as a model of sensory processing break17:40-18:00 J. Sinkkonen IB-type clustering for continuous finite data18:00-18:20 G. Chechik GIB: Gaussian Information Bottleneck18:20-18:40 Y. Crammer IB and data-representation – Bregman to the rescue 18:40-19:00 B. Westover Achievable rates for Pattern Recognition

http://www.cs.huji.ac.il/labs/learning/Theses/Noam_phd1.ps.gz

http://www.cs.huji.ac.il/~galel/Papers/ibem.ps

http://www.cs.huji.ac.il/labs/learning/Papers/uai03.ps.gz

http://www.cs.huji.ac.il/~tishby/NIPS-Workshop/Sinkkonen.txt

http://www.cs.huji.ac.il/labs/learning/Papers/GIB.ps.gz

http://www.cs.huji.ac.il/~tishby/NIPS-Workshop/abstract3.ps

http://www.cs.huji.ac.il/~tishby/NIPS-Workshop/Westover.txt

Outline:Outline: The fundamental dilemmaThe fundamental dilemma: : Simplicity/Complexity Simplicity/Complexity versusversus Accuracy Accuracy

Lessons from Lessons from Statistical PhysicsStatistical Physics for Machine Learning for Machine Learning Is there a Is there a “right level” “right level” of description?of description?

A Variational A Variational PrinciplePrinciple Shannon’s source and channel Shannon’s source and channel codingcoding – – dual problemsdual problems [Mutual] [Mutual] InformationInformation as the as the Fundamental QuantityFundamental Quantity Information BottleneckInformation Bottleneck (IB) (IB) andand Efficient Efficient RepresentationsRepresentations Finite Data IssuesFinite Data Issues

Algorithms, Applications and ExtensionsAlgorithms, Applications and Extensions Words, documents and meaning…Words, documents and meaning… Understanding neural codes …Understanding neural codes … Quantization of side-information Quantization of side-information Relevant linear dimension reduction: Gaussian IB Relevant linear dimension reduction: Gaussian IB Using “irrelevant” side informationUsing “irrelevant” side information Multivariate-IB and Graphical ModelsMultivariate-IB and Graphical Models Sufficient Dimensionality Reduction (SDR)Sufficient Dimensionality Reduction (SDR)

The importance of being principled…The importance of being principled…

Learning Theory:The fundamental dilemma…

XX

YY

y=f(x)y=f(x)

Good modelsGood models should enable should enable Prediction Prediction of new data…of new data…

Tradeoff between Tradeoff between accuracy and accuracy and simplicitysimplicity

Or in unsupervised learning …

XX

YY

20 25 30 35 40 4510

20

30

40

50

60

70

Cluster Cluster models?models?

Lessons from Statistical Physics Physical systems with many degrees of freedom:Physical systems with many degrees of freedom: What is the right level of description?What is the right level of description?

“ “Microscopic Variables” – Microscopic Variables” – dynamical variablesdynamical variables ( (positions, positions,

momenta,…momenta,…)) “ “Variational Functions” – Variational Functions” – Thermodynamic Thermodynamic potentialspotentials

(Energy, Entropy, Free-(Energy, Entropy, Free-energy,…)energy,…)

Competition Competition between between order order ((energyenergy)) and and disorder disorder (entropy)(entropy) ……

Emergent “Emergent “relevant” relevant” descriptiondescription

Order Order ParametersParameters

(magnetization,…)(magnetization,…)

Statistical Models in Learning and AI

Statistical models of complex systems:Statistical models of complex systems: What is the right level of description?What is the right level of description?

“ “Microscopic Variables” – Microscopic Variables” – probability probability distributionsdistributions ( (over all over all observed and hidden variables)observed and hidden variables) “ “Variational Functions” – Variational Functions” – Log-likelihood, Log-likelihood, Entropy,… ?Entropy,… ? (Multi-(Multi-Information, “Free-energy”,… )Information, “Free-energy”,… ) Competition between Competition between accuracy accuracy and and complexitycomplexity Emerged “Emerged “relevant” relevant”

descriptiondescriptionFeaturesFeatures

(clusters, sufficient statistics,(clusters, sufficient statistics,…)…)

The Fundamental Dilemma (of science):

Model Complexity vs Prediction Accuracy

Complexity

Accu

racy

Possible Models/representations

Efficiency Efficiency

Surv

ivab

ilitSu

rviv

abilit

yy

Limited dataLimited data

Bounded Bounded ComputationComputation

Can we quantify it…?

When there is a (relevant) prediction or distortion measure

Accuracy small average distortion (good predictions)

Complexity short description (high compression)

A general tradeoff between distortion and compression:

Shannon’s Information Theory

Shannon’s Information Theory

SentSentmessagemessagess

sourcsourcee

receivreceiverer

ReceivedReceivedmessagesmessages

channechannell

symbolssymbols

01101010011100110101001110……

encodeencoderr

decodedecoderr

SourceSourcecodingcoding

ChannelChannelcodingcoding

SourceSourcedecodingdecoding

ChannelChanneldecodingdecoding

Error Error CorrectionCorrectionChannel CapacityChannel Capacity

CompressioCompressionnSource EntropySource Entropy

DecompressiDecompressionon

Rate vs DistortionRate vs Distortion Capacity vs EfficiencyCapacity vs Efficiency

We need to index the max number of non-overlapping green blobs inside the blue blob:

(mutual information!)

X X̂)x|x̂(p

)ˆ|(2 XXnH

)(2 XnH

)ˆ,()ˆ|()( 22/2 XXnIXXnHXnH

Compression and Mutual Compression and Mutual InformationInformation

Axioms for “(multi) information”(w I Nemenman)

M1: For two nodes, X and Y, the shared information, I[X;Y], is a continuous function of p(x,y).

M2: when one of the variables defines the other uniquely (1-1 mapping from X to Y) and both p(x)=p(y)=1/K , then I[X;Y] is a monotonic function of K.

M3: If X is partitioned into two subsets, X1 and X2, then the amount of shared information is additive in the following sense: I[X;Y]=I[(X1,X2);Y]+p(x1)I[X;Y|X1]+p(x2)I[X;Y|X2]

M4: Shared information is symmetric: I[X;Y]=I[Y;X]. M5:

Theorem:

1 2 1 1 2 1 2 1[ ; ;...; ] [ ; ;...; ] [( , ,..., ); ]N N N NI X X X I X X X I X X X X

1 2

1 21 2 1 2

, ,..., 1 2

( , ,..., )[ ; ;...; ] ( , ,..., ) log( ) ( ) ( )

N

NN N

x x x N

p x x xI X X X p x x xp x p x p x

The Dual Variational problems of IT:The Dual Variational problems of IT:RateRate vs vs Distortion Distortion and and CapacityCapacity vs vs

EfficiencyEfficiency Q: What is the simplest representation – fewer bits(/sec) (Rate) for a given expected distortion (accuracy)?

A: (RDT, Shannon 1960) solve:

Q: What is the maximum number of bits(/sec) that can be sent reliably (prediction rate) for a given efficiency of the channel (power, cost)?

A: (Capacity-power tradeoff, Shannon 1956) solve:

)ˆ,(min)( XXIDRDd

),ˆ(max)( YXIECEe

Rate-DistortionRate-Distortion theory theory The tradeoff between expected representation size (Rate) and the expected distortion is expressed by the rate-distortion function:

By introducing a Lagrange multiplier, , we have a variational principle:

with:

)ˆ,(min)( XXIDRDd

)ˆ,()ˆ,()ˆ,()]|ˆ([

xxpxxdXXIxxpL

)ˆ,()|ˆ()()ˆ,(ˆ,

)ˆ,(xxdxxpxpxxd

xxxxp

The Rate-Distortion function, R(D), is the optimal rate for a given distortion and is a convex function.

R(D)

D

Achievable region

impossible

DDR )(

-bits/-bits/accuracyaccuracy

00

The Capacity – Efficiency TradeoffThe Capacity – Efficiency Tradeoff

AchievableAchievable RegionRegion

E (Efficiency (power))E (Efficiency (power))

C

(E)

C(E)

Ca

pacit

y at

Ca

pacit

y at

EEbits/ergbits/erg

Double matching of source to channel- eliminates the need for

coding (?!)21( ) log

2R D

D

2

1( ) log 12

EC E

R(D)R(D)

DD

C(E)C(E)

EE

In the case of a Gaussian channel In the case of a Gaussian channel (w. power constraint)(w. power constraint) and and square distortion, double matching means:square distortion, double matching means:

221 ED

And No coding!And No coding!

““There isThere is a curious and provocative duality between the a curious and provocative duality between the properties of a source with a distortion measure and those of a properties of a source with a distortion measure and those of a channel … if we consider channels in which there is a “cost” channel … if we consider channels in which there is a “cost” associated with the different input letters…associated with the different input letters…

The solution to this problem amounts, mathematically, to The solution to this problem amounts, mathematically, to maximizingmaximizing a mutual information under linear inequality a mutual information under linear inequality constraint… which leads to a capacity-cost function C(E) for the constraint… which leads to a capacity-cost function C(E) for the channel…channel…

In a somewhat dual way, evaluating the rate-distortion function In a somewhat dual way, evaluating the rate-distortion function R(D) for source amounts, mathematically, to R(D) for source amounts, mathematically, to minimizingminimizing a a mutual Information … again with a linear inequality constraint.mutual Information … again with a linear inequality constraint.

This duality can be pursued further and is related to the duality This duality can be pursued further and is related to the duality between past and future and the notions of control and between past and future and the notions of control and knowledge. Thus we may have knowledge of the past but knowledge. Thus we may have knowledge of the past but cannot control it; we may control the future but have no cannot control it; we may control the future but have no knowledge of it.”knowledge of it.”

C. E. Shannon (1959)C. E. Shannon (1959)

Bottlenecks and Neural Nets Auto association: forcing compact representations

provide a “relevant code” of w.r.t.

X Y

X̂

Input Output

Sample 1 Sample 2

Past Future

X̂ X Y

Bottlenecks in Nature…(w E Schniedman, Rob de Ruyter, W Bialek, N Brenner)

The idea: find a compressed variable that enables short encoding of ( small

)while preserving as much as possible the

information on the relevant signal ( )

X X̂)x|x̂(p Y)x̂|y(p

)x̂(p)X̂,X(I )Y,X̂(I

)Y,X(I

X̂)X̂,X(I

)Y,X̂(I

X

Y

A A Variational PrincipleVariational PrincipleWe want a short representation of X that keeps the information about another variable, Y, if possible.

X

YX YXI

ˆ

),(

)ˆ,( XXI

),ˆ( YXI

),ˆ()ˆ,()|ˆ( YXIXXIxxpL

Self Consistent EquationsSelf Consistent Equations Marginal:

Markov condition:

Bayes’ rule:

x

xpxxpxp )()|ˆ()ˆ(

x

xxpxypxyp )ˆ|()|()ˆ|(

)|ˆ()ˆ()()ˆ|( xxpxpxpxxp

0)|ˆ()]|ˆ([

xxpxxpL

)ˆ,(exp

),()ˆ()|ˆ( xxD

xZxpxxp KL

The emerged effective distortioneffective distortion measure:

y

KLKL

xypxypxyp

xypxypDxxD

)ˆ|()|(log)|(

)ˆ|(|)|(ˆ,

• Regular if is absolutely continuous w.r.t.• Small if predicts y as well as x:

)ˆ|( xyp )|( xypx̂

yx

yx

xyp

xxp

xyp

)ˆ|(

)|ˆ(

)|(

ˆ

I-Projections on a set of distributions (Csiszar 75,84)

• The I-projection of a distribution q(x) on a convex set of distributions L:

qpDLqIPR KLLp |~minarg),( ~

),( pfL

)(xq

)(~ xp)(* xp

The The Blahut-Arimoto Blahut-Arimoto AlgorithmAlgorithm

)ˆ(xp )|ˆ( xxp

dXXIxZxxpxpxxpxp

)ˆ,(min),(logminmin)|ˆ(),ˆ()|ˆ()ˆ(

)|ˆ()()ˆ(

)ˆ,(exp),()ˆ()|ˆ(1

xxpxpxp

xxdxZxpxxp

tx

t

t

tt

The Information BottleneckInformation Bottleneck Algorithm

)ˆ,()ˆ,(min

),(logminminmin

)|ˆ(),ˆ(),ˆ|(

)|ˆ()ˆ()ˆ|(

xxDXXI

xZ

KLxxpxpxyp

xxpxpxyp

xtt

tx

t

KLt

t

tt

xxpxypxyp

xxpxpxp

xxDxZxpxxp

)ˆ|()|()ˆ|(

)|ˆ()()ˆ(

)ˆ,(exp),()ˆ()|ˆ(1

“free energy”

The IB emergent effective distortion measure:

)ˆ|(|)|(ˆ, xypxypDxxD KLKL

)ˆ(xp )|ˆ( xxp

)ˆ|( xypThe IBalgorithm

Key TheoremsKey TheoremsIB Coding Theorem (Wyner 1975, Bachrach, Navot, Tishby 2003):The IB variational problem provides a tight convex lower bound on the expected size of the achievable representations of X, that maintain at least of mutual information on the variable Y. The bound can be achieved by: .

Formally equivalent to “Channel coding with Side-Information” (introduced in a very different context by Wyner, 75).

The same bound is true for the dual channel coding problem and the optimal representation constitutes a “perfectly matched source-channel” and requires “no further coding”.

IB Algorithm Convergence (Tishby 99, Friedman, Slonim,Tishby 01):

The Generalized Arimoto-Blauht IB algorithm converges to a local minima of the IB functional, (including its generalized multivariate case).

),ˆ( YXI,X̂

2|||ˆ| XX

Source (p(x))Channel: (q(y|x))

Entropy: Capacity:

Rate-Distortion function:Capacity-Expense function:

Constrained Reliability-Rate functionConstrained Reliability-Rate function

Information Bottleneck (source)Information Bottleneck (channel)

( )

( ) max ( , )pe p E

C E I p q

max ( , )pC I p q

ˆ( , )

ˆ( , ( , )) max ( ; )X

Y qI p q I

L I p x y I Y X

min ( , )qH I p q

( ) min ( , )qd D

R D I p q

ˆ

ˆˆ( , )

ˆ( , ) max min ( | )p

q p KLI p q R d D

F R D D p p

ˆˆ( , )

ˆ[ , ( , )] min ( ; )Y

X pI p q I

L I p x y I X X

ˆˆ( , ) ;

ˆ( , ) max min ( | )p

p q KLI p q R e E

F R E D q q

Summary

-The IB method turns this unified coding principle into algorithms for extracting “informative” relevant structures from empirical joint probability distributions P(X1,X2)…

-The Multivariate IB extends this framework to extract “informative” structures from multivariate distributions, P(X1,…,Xn), via Min Max I(X1,…,Xn), and Graphical models- IB can be further extended for Sufficient Dimensionality Reduction a method for finding informative continuous low dimensional representations via Min Max I(X1,X2)

- Shannon’s Information Theory suggests a compelling framework for quantifying the “Complexity-Accuracy” dilemma, by unifying source and channel coding as a Min Max of mutual information.

Many thanks to:

Fernando PereiraWilliam BialekNir Friedman

Noam SlonimGal Chechik

Amir GlobersonRan Bachrach

Amir NavotIlya Nemenman

Documents

Saturday, December 13 Morning session