36
Monotonicity of Information Divergence Information-Stability, Error Probability, Info-Projection Shannon Capacity Determines Limits of Statistical Accuracy Simplest is Best Summary Celebrations for three influential scholars in Information Theory and Statistics: Tom Cover: On the occasion of his 70th birthday Coverfest.stanford.edu Elements of Information Theory Workshop Stanford University, May 16, 2008: TODAY! Imre Csiszár: On the occasion of his 70th birthday www.renyi.hu/infocom Information and Communication Conference Renyi Institute, Budapest, August 25-28, 2008 Jorma Rissanen: On the occasion of his 75th birthday Festschrift at www.cs.tut.fi/tabus/ presented at the IEEE Information Theory Workshop Porto, Portugal, May 8, 2008

Monotonicity of Information Divergence Information

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Celebrations forthree influential scholars in Information Theory and Statistics:

Tom Cover: On the occasion of his 70th birthdayCoverfest.stanford.eduElements of Information Theory WorkshopStanford University, May 16, 2008: TODAY!

Imre Csiszár: On the occasion of his 70th birthdaywww.renyi.hu/∼infocomInformation and Communication ConferenceRenyi Institute, Budapest, August 25-28, 2008

Jorma Rissanen: On the occasion of his 75th birthdayFestschrift at www.cs.tut.fi/∼tabus/presented at the IEEE Information Theory WorkshopPorto, Portugal, May 8, 2008

Page 2: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Principles of Information Theory in Probabilityand Statistics

Andrew BarronDepartment of Statistics

Yale University

May 16, 2008Elements of Information Theory Workshop

On the Occasion of the Festival for Tom Cover

Andrew Barron Principles of Information Theory in Probability and Statist. 2/36

Page 3: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Outline of Principles

1 Monotonicity of Information Divergence

2 Information-Stability, Error Probability, Info-Projection

3 Shannon Capacity Determines Limits of Statistical Accuracy

4 Simplest is Best

5 Summary

Andrew Barron Principles of Information Theory in Probability and Statist. 3/36

Page 4: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Outline of Principles

1 Monotonicity of Information Divergence

2 Information-Stability, Error Probability, Info-Projection

3 Shannon Capacity Determines Limits of Statistical Accuracy

4 Simplest is Best

5 Summary

Andrew Barron Principles of Information Theory in Probability and Statist. 4/36

Page 5: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Monotonicity of Information Divergence

Chain Rule

D(PX ,X ′‖P∗X ,X ′) = D(PX‖P∗

X ) + E D(PX ′|X‖P∗X ′|X )

= D(PX ′‖P∗X ′) + E D(PX |X ′‖P∗

X |X ′)

Markov Chains

D(PXn‖P∗) ≤ D(PXm‖P∗) for n > m

and log pn(Xn)/p∗(Xn) is a Cauchy sequence in L1(P)

Andrew Barron Principles of Information Theory in Probability and Statist. 5/36

Page 6: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Monotonicity of Information Divergence

Nonnegative Martingales ρn equal the density of ameasure Qn and can be examined in the same way by thechain rule for n > m

D(Qn‖P) = D(Qm‖P) +

∫ (ρn log

ρn

ρm

)dP

Thus D(Qn‖P) is an increasing sequence. When it isbounded ρn is a Cauchy sequences in L1(P) with limit ρdefining a measure Q, also, log ρn is a Cauchy sequence inL1(Q) and

D(Qn‖P) ↗ D(Q‖P)

Andrew Barron Principles of Information Theory in Probability and Statist. 6/36

Page 7: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Monotonicity of Information Divergence: CLT

Central Limit Theorem Setting:

{Xi} i.i.d. mean zero, finite variancePn = PYn is distribution of Yn = X1+X2+...Xn√

n

P∗ is the corresponding normal distribution

Chain Rule: Action is mysterious in this case

D(PYm,Yn‖P∗Ym,Yn

) = D(PYm‖P∗) + ED(PYn|Ym‖P∗Yn|Ym

)

= D(PYn‖P∗) + ED(PYm|Yn‖P∗Ym|Yn

)

Andrew Barron Principles of Information Theory in Probability and Statist. 7/36

Page 8: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Monotonicity of Information Divergence: CLT

Entropy Power Inequality

e2H(X+X ′) ≥ e2H(X) + e2H(X ′)

yieldsD(P2n‖P∗) ≤ D(Pn‖P∗)

Information Theoretic proof of CLT (B. 1986):

D(Pn‖P∗) → 0 iff finite

Andrew Barron Principles of Information Theory in Probability and Statist. 8/36

Page 9: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Monotonicity of Information Divergence: CLT

Entropy Power Inequality

e2H(X+X ′) ≥ e2H(X) + e2H(X ′)

Generalized Entropy Power Inequality (Madiman&B.2006)

eH(X1+...+Xn) ≥ 1r(S)

∑s∈S

e2H(P

i∈s Xi )

Proof: simple L2 projection properties of entropy derivative.Consequence, for all n > m,

D(Pn‖P∗) ≤ D(Pm‖P∗)

B. and Madiman 2006, Tolino and Verdú 2006Earlier elaborate proof by Artstein, Ball, Barthe, Naor 2004.

Andrew Barron Principles of Information Theory in Probability and Statist. 9/36

Page 10: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Outline of Principles

1 Monotonicity of Information Divergence

2 Information-Stability, Error Probability, Info-Projection

3 Shannon Capacity Determines Limits of Statistical Accuracy

4 Simplest is Best

5 Summary

Andrew Barron Principles of Information Theory in Probability and Statist. 10/36

Page 11: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

AEP and behavior of optimal tests

Stability of log-likelihood ratios,1n

logp(Y1, Y2, . . . Yn)

q(Y1, Y2, . . . , Yn)→ D(P‖Q) with P − prob 1

where D(P‖Q) is the relative entropy (I-divergence) rate.Implication: Associated log-likelihood ratio test An hasasymptotic P-power 1 (at most finitely many mistakesP(Ac

n i .o.) = 0) and has optimal Q-prob of error

Q(An) = exp{−n[D + o(1)]}Most general known form of the Chernoff-Stein Lemma.Intrinsic role for information divergence rate

D(P‖Q) = lim1n

D(PY n‖QY n)

Andrew Barron Principles of Information Theory in Probability and Statist. 11/36

Page 12: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Large Deviations for Empirical Prob and ConditionalLimit

P∗: Information Projection of Q onto convex CPythagorean Identity (Csiszar 75, Topsoe 79): For P in C

D(P‖Q) ≥ D(C‖Q) + D(P‖P∗)

whereD(C‖Q) = inf

P∈CD(P‖Q)

Empirical Distribution PnIf D(interiorC‖Q) = D(C‖Q) then

Q{Pn ∈ C} = exp{−n[D(C‖Q) + o(1)]}and the conditional distribution PY1,Y2,...,Yn|{Pn∈C} convergesto P∗

Y1,Y2,...,Ynin the I-divergence rate sense (Csiszar 1985)

Andrew Barron Principles of Information Theory in Probability and Statist. 12/36

Page 13: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Outline of Principles

1 Monotonicity of Information Divergence

2 Information-Stability, Error Probability, Info-Projection

3 Shannon Capacity Determines Limits of Statistical Accuracy

4 Simplest is Best

5 Summary

Andrew Barron Principles of Information Theory in Probability and Statist. 13/36

Page 14: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Information Capacity

A Channel θ → Y is a family of probability distributions

{PY |θ : θ ∈ Θ}

Information Capacity

C = maxPθ

I(θ; Y )

Andrew Barron Principles of Information Theory in Probability and Statist. 14/36

Page 15: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Communications Capacity

Ccom = maximum rate of reliable communication,measured in message bits per use of a channel

Shannon Channel Capacity Theorem (Shannon 1948)

Ccom = C

Andrew Barron Principles of Information Theory in Probability and Statist. 15/36

Page 16: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Data Compression Capacity

Minimax Redundancy

Red = minQY

maxθ∈Θ

D(PY |θ‖QY )

Data Compression Capacity Theorem

Red = C

(Gallager, Davisson & Leon-Garcia, Ryabko)

Andrew Barron Principles of Information Theory in Probability and Statist. 16/36

Page 17: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Statistical Risk Setting

Loss function`(θ, θ′)

Examples:

Kullback Loss

`(θ, θ′) = D(PY |θ‖PY |θ′)

Squared metric loss, e.g. squared Hellinger loss:

`(θ, θ′) = d2(θ, θ′)

Andrew Barron Principles of Information Theory in Probability and Statist. 17/36

Page 18: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Statistical Capacity

Estimators: θ̂n

Based on sample Y of size n

Minimax Risk (Wald):

rn = minθ̂n

maxθ

E`(θ, θ̂n)

Andrew Barron Principles of Information Theory in Probability and Statist. 18/36

Page 19: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Ingredients for determination of Statistical Capacity

Kolmogorov Metric Entropy of S ⊂ Θ:

H(ε) = max{log Card(Θε) : d(θ, θ′) > ε for θ, θ′ ∈ Θε ⊂ S}

Loss Assumption, for θ, θ′ ∈ S:

`(θ, θ′) ∼ D(PY |θ‖PY |θ′) ∼ d2(θ, θ′)

Andrew Barron Principles of Information Theory in Probability and Statist. 19/36

Page 20: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Statistical Capacity Theorem

For infinite-dimensional Θ

With metric entropy evaluated a critical separation εn

Statistical Capacity TheoremMinimax Risk ∼ Info Capacity Rate ∼ Metric Entropy rate

rn ∼ Cn

n∼ H(εn)

n∼ ε2

n

Yang 1997, Yang and B. 1999, Haussler and Opper 1997

Andrew Barron Principles of Information Theory in Probability and Statist. 20/36

Page 21: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Outline of Principles

1 Monotonicity of Information Divergence

2 Information-Stability, Error Probability, Info-Projection

3 Shannon Capacity Determines Limits of Statistical Accuracy

4 Simplest is Best

5 Summary

Andrew Barron Principles of Information Theory in Probability and Statist. 21/36

Page 22: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Shannon Codes

Kraft-McMillan characterization:Uniquely decodeable codelengths

L(x), x ∈ X ,∑

x

2−L(x) ≤ 1

L(x) = log 1/p(x) p(x) = 2−L(x)

Operational meaning of probability:

A probability distribution p is given by a choice of code

Andrew Barron Principles of Information Theory in Probability and Statist. 22/36

Page 23: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Complexity

Kolmogorov Idealized Compression:K (Y ) = Length of shortest computer code for Y

on a given universal computer

Shannon Idealized Codelength (expectation optimal):

log 1/p∗(Y )

But p∗ is not generally knownHybrid: Statistical measure of Complexity of Y

minp

[log 1/p(Y ) + L(p)

]Valid for any L(p) satisfying Kraft summability.

Andrew Barron Principles of Information Theory in Probability and Statist. 23/36

Page 24: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Complexity

Minimum Description Length Principle (MDL)Special cases: Rissanen 1978,1983,...Two-stage codelength formulation:(Cover Scratch pad 1981, B. 1985, B. and Cover 1991)

L(Y ) = minp

[log 1/p(Y ) + L(p)

]bits for x given p + bits for p

Corresponding statistical estimator p̂ achieves the aboveminimization in a family of distributions for a specified L(p)

Andrew Barron Principles of Information Theory in Probability and Statist. 24/36

Page 25: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

MDL Analysis

Redundancy of Two-stage Code:

Red =1n

E{

minp

[log

1p(Y )

+ L(p)]− log

1p∗(Y )

}bounded by Index of Resolvability:

Resn(p∗) = minp

{D(p∗||p) +

L(p)

n

}Statistical Risk Analysis in i.i.d. case:

E d2(p∗, p̂) ≤ minp

{D(p∗‖p) +

L(p)

n

}B. 1985, B.&Cover 1991, B., Rissanen, Yu 1998, Li 1999,Grunwald 2007

Andrew Barron Principles of Information Theory in Probability and Statist. 25/36

Page 26: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

MDL Analysis

Statistical Risk Analysis:

E d2(p∗, p̂) ≤ minp{D(p∗‖p) + L(p)/n}

Special Cases:Traditional parametric: L(θ) = (dim/2) log n + CNonparametric: L(p) = Metric entropy

(log cardinality of optimal net)Idealized: L(p) = Kolmogorov complexityAdaptation:Achieves minimax optimal rates simultaneously in everycomputable subfamily of distributions

Andrew Barron Principles of Information Theory in Probability and Statist. 26/36

Page 27: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

MDL Analysis: Key to risk consideration

Discrepancy between training sample and future

Disc(p) = logp∗(Y )

p(Y )− log

p∗(Y ′)

p(Y ′)

Future term may be replaced by population counterpartDiscrepancy control: If L(p) satisfies the Kraft sum then

E[

supp{Disc(p) + 2L(p)}

]≥ 0

From which the risk bound follows:Risk ≤ Redundancy ≤ Resolvability

E d2(p∗, p̂) ≤ Red ≤ Resn(p∗)

Andrew Barron Principles of Information Theory in Probability and Statist. 27/36

Page 28: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Statistically valid penalized likelihood

New result(B., Li, Huang, Luo 2008, Festschrift for Jorma Rissanen)Penalized Likelihood:

p̂ = arg minp

{1n

log1

p(Y )+ penn(p)

}Penalty condition:

penn(p) ≥ 1n

minp̃{2L(p̃) + ∆n(p, p̃)}

where the distortion ∆n(p, p̃) is the difference indiscrepancies at p and a representer p̃Risk conclusion:

Ed2(p, p̂) ≤ infp{D(p∗‖p) + penn(p)}

Andrew Barron Principles of Information Theory in Probability and Statist. 28/36

Page 29: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Example: `1 penalties on coefficients in gen. linearmodels

G is a dictionary of candidate basis functionsWavelets, splines, polynomials, trigonometric terms,sigmoids, explanatory variables and their interactions

Candidate functions in the linear span

f (x) = fθ(x) =∑g∈G

θg g(x)

`1 norm of coefficients

‖θ‖1 =∑

g

|θg |

Andrew Barron Principles of Information Theory in Probability and Statist. 29/36

Page 30: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Example Models

p = pf specified through a family of function f

Regressionpf (y |x) = Normal(f (x), σ2)

Logistic regression with y ∈ {0, 1}

pf (y |x) = Logistic(f (x)) for y = 1

Log-density estimation

pf (x) =p0(x) exp{f (x)}

cf

Andrew Barron Principles of Information Theory in Probability and Statist. 30/36

Page 31: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

`1 Penalty

penn(fθ) = λn‖θ‖1 where fθ(x) =∑

g∈G θg g(x)

Popular penalty: Chen & Donoho (96) Basis Pursuit;Tibshirani (96) LASSO; Efron et al (04) LARS;Precursors: Jones (92), B.(90,93,94) greedy algorithm andanalysis of combined `1 and `0 penalty

Risk analysis: specify valid λn for risk ≤ resolvabilityComputation analysis: bound accuracy of new`1-penalized greedy pursuit algorithm

Andrew Barron Principles of Information Theory in Probability and Statist. 31/36

Page 32: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

`1 penalty is valid for λn of order 1/√

n

Example: `1 penalized log-density estimation, i.i.d. case

θ̂ = argminθ

{1n

log1

pfθ(x)+ λn‖θ‖1

}Risk bound:

Ed(f ∗, fθ̂) ≤ infθ

{D(f ∗||fθ) + λn‖θ‖1

}Valid for λn ≥ B

√Hn with H = log Card(G)

B = bound on the range of functions in GFor infinite cardinality G use metric entropy in place of HResults for regression shown in a companion paper

Andrew Barron Principles of Information Theory in Probability and Statist. 32/36

Page 33: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

`1 penalty is valid for λn of order 1/√

n

Example: `1 penalized log-density estimation, i.i.d. case

θ̂ = argminθ

{1n

log1

pfθ(x)+ λn‖θ‖1

}Risk bound:

Ed(f ∗, fθ̂) ≤ infθ

{D(f ∗||fθ) + λn‖θ‖1

}True with

λn = B

√Hn

with H = log Card(G)

Risk of order λn when the target has finite `1 norm

Andrew Barron Principles of Information Theory in Probability and Statist. 33/36

Page 34: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Comment on proof

Shannon-like demonstration of the existence of thevariable complexity cover propertyInspiration from technique originating with Lee Jones (92)Representer f̃ of fθ of the form

f̃ (x) =vm

m∑k=1

gk (x)

g1, . . . gm picked at random from G, independently, where garises with probability proportional to |θg |May pick them in greedy fashion as in demonstration offast computation propertiesIn the paper in the Festschrift for Rissanen with summaryin the ITW - Porto proceedings

Andrew Barron Principles of Information Theory in Probability and Statist. 34/36

Page 35: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Outline of Principles

1 Monotonicity of Information Divergence

2 Information-Stability, Error Probability, Info-Projection

3 Shannon Capacity Determines Limits of Statistical Accuracy

4 Simplest is Best

5 Summary

Andrew Barron Principles of Information Theory in Probability and Statist. 35/36

Page 36: Monotonicity of Information Divergence Information

Monotonicity of Information DivergenceInformation-Stability, Error Probability, Info-Projection

Shannon Capacity Determines Limits of Statistical AccuracySimplest is Best

Summary

Summary

Information divergence is monotone

Info Capacity ∼ Minimax Redundancy ∼ Minimax Risk

Adaptivity of MDL: simultaneously minimax optimal

Penalized Likelihood risk analysis`0 penalty dim

2 log n classically analyzed

`1 penalty λn ‖θ‖1 valid for λn ≥√

H/n.

Andrew Barron Principles of Information Theory in Probability and Statist. 36/36