If you can't read please download the document
Upload
lykhuong
View
220
Download
0
Embed Size (px)
Citation preview
Tutorial Part I:Information theory meets machine learning
Emmanuel Abbe Martin WainwrightUC Berkeley Princeton University
(UC Berkeley and Princeton) Information theory and machine learning June 2015 1 / 46
Introduction
Era of massive data sets
Fascinating problems at the interfaces between information theory andstatistical machine learning.
1 Fundamental issues Concentration of measure: high-dimensional problems are remarkably
predictable Curse of dimensionality: without structure, many problems are hopeless Low-dimensional structure is essential
2 Machine learning brings in algorithmic components Computational constraints are central Memory constraints: need for distributed and decentralized procedures Increasing importance of privacy
Historical perspective: info. theory and statistics
Claude Shannon Andrey Kolmogorov
A rich intersection between information theory and statistics
1 Hypothesis testing, large deviations2 Fisher information, Kullback-Leibler divergence3 Metric entropy and Fanos inequality
Statistical estimation as channel coding
Codebook: indexed family of probability distributions {Q | }
Codeword: nature chooses some
Xn1Q
Channel: user observes n i.i.d. draws Xi Q
Decoding: estimator Xn1 7 such that
Statistical estimation as channel coding
Codebook: indexed family of probability distributions {Q | }
Codeword: nature chooses some
Xn1Q
Channel: user observes n i.i.d. draws Xi Q
Decoding: estimator Xn1 7 such that
Perspective dating back to Kolmogorov (1950s) with many variations:
codebooks/codewords: graphs, vectors, matrices, functions, densities....
channels: random graphs, regression models, elementwise probes ofvectors/machines, random projections
closeness : exact/partial graph recovery in Hamming, p-distances,L(Q)-distances, sup-norm etc.
Machine learning: algorithmic issues to forefront!
1 Efficient algorithms are essential only (low-order) polynomial-time methods can ever be implemented trade-offs between computational complexity and performance? fundamental barriers due to computational complexity?
2 Distributed procedures are often needed many modern data sets: too large to stored on a single machine need methods that operate separately on pieces of data trade-offs between decentralization and performance? fundamental barriers due to communication complexity?
3 Privacy and access issues conflicts between individual privacy and benefits of aggregation? principled information-theoretic formulations of such trade-offs?
(UC Berkeley and Princeton) Information theory and machine learning June 2015 6 / 46
Part I of tutorial: Three vignettes
1 Graphical model selection
2 Sparse principal component analysis
3 Structured non-parametric regression and minimax theory
(UC Berkeley and Princeton) Information theory and machine learning June 2015 7 / 46
Vignette A: Graphical model selectionSimple motivating example: Epidemic modeling
Disease status of person s: xs =
{+1 if individual s is infected
1 if individual s is healthy
(1) Independent infection
Q(x1, . . . , x5) 5
s=1
exp(sxs)
(2) Cycle-based infection
Q(x1, . . . , x5) 5
s=1
exp(sxs)
(s,t)C
exp(stxs xt)
(3) Full clique infection
Q(x1, . . . , x5) 5
s=1
exp(sxs)
s 6=t
exp(stxsxt)
Possible epidemic patterns (on a square)
From epidemic patterns to graphs
Underlying graphs
Model selection for graphsdrawn n i.i.d. samples from
Q(x1, . . . , xp; ) exp{
sV
sxs +
(s,t)E
stxsxt}
graph G and matrix []st = st of edge weights are unknown
data matrix Xn1 {1,+1}np
want estimator Xn1 7 G to minimize error probability
Qn[G(Xn1 ) 6= G
]
Prob. that estimated graph differs from truth
Channel decoding:
Think of graphs as codewords, and the graph family as a codebook.
Past/on-going work on graph selection
exact polynomial-time solution on trees (Chow & Liu, 1967)
testing for local conditional independence relationships (e.g., Spirtes et al,2000; Kalisch & Buhlmann, 2008)
pseudolikelihood and BIC criterion (Csiszar & Talata, 2006)
pseudolikelihood and 1-regularized neighborhood regression Gaussian case (Meinshausen & Buhlmann, 2006) Binary case (Ravikumar, W. & Lafferty et al., 2006)
various greedy and related methods: (Bresler et al., 2008; Bresler, 2015;Netrapalli et al., 2010; Anandkumar et al., 2013)
lower bounds and inachievability results information-theoretic bounds (Santhanam & W., 2012) computational lower bounds (Dagum & Luby, 1993; Bresler et al., 2014) phase transitions and performance of neighborhood regression (Bento &
Montanari, 2009)
US Senate network (20042006 voting)
Experiments for sequences of star graphs
p = 9 d = 3
p = 18d = 6
Empirical behavior: Unrescaled plots
0 100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1
Number of samples
Pro
b. s
ucce
ss
Star graph; Linear fraction neighbors
p = 64p = 100p = 225
Plots of success probability versus raw sample size n.
Empirical behavior: Appropriately rescaled
0 0.5 1 1.5 20
0.2
0.4
0.6
0.8
1
Control parameter
Pro
b. s
ucce
ss
Star graph; Linear fraction neighbors
p = 64p = 100p = 225
Plots of success probability versus rescaled sample size nd2 log p
Some theory: Scaling law for graph selectiongraphs Gp,d with p nodes and maximum degree d
minimum absolute weight min on edges
how many samples n needed to recover the unknown graph?
Theorem (Ravikumar, W. & Lafferty, 2010; Santhanam & W., 2012)
Achievable result: Under regularity conditions, for a graph estimate Gproduced by 1-regularized logistic regression:
n > cu(d2 + 1/2min
)log p
Lower bound on sample size
= Q[G 6= G] 0 Vanishing probability of error
Necessary condition: For graph estimate G produced by any algorithm.
n < c(d2 + 1/2min
)log p
Upper bound on sample size
= Q[G 6= G] 1/2 Constant probability of error
Information-theoretic analysis of graph selection
Question:
How to prove lower bounds on graph selection methods?
Answer:
Graph selection is an unorthodox channel coding problem.
codewords/codebook: graph G in some graph class G
channel use: draw sample Xi = (Xi1, . . . , Xip) {1,+1}p from thegraph-structured distribution QG
decoding problem: use n samples {X1, . . . , Xn} to correctly distinguishthe codeword
X1, X2, . . . , XnQ(X | G)G
Proof sketch: Main ideas for necessary conditionsbased on assessing difficulty of graph selection over various sub-ensemblesG Gp,d
choose G G u.a.r., and consider multi-way hypothesis testing problembased on the data Xn1 = {X1, . . . , Xn}
for any graph estimator : Xn G, Fanos inequality implies that
Q[(Xn1 ) 6= G] 1I(Xn1 ;G)
log |G| o(1)
where I(Xn1 ;G) is mutual information between observations Xn1 and
randomly chosen graph G
remaining steps:
1 Construct difficult sub-ensembles G Gp,d
2 Compute or lower bound the log cardinality log |G|.
3 Upper bound the mutual information I(Xn1 ;G).
A hard d-clique ensemble
Base graph G Graph Guv Graph Gst
1 Divide the vertex set V into pd+1 groups of size d+ 1, and form the basegraph G by making a (d+ 1)-clique C within each group.
2 Form graph Guv by deleting edge (u, v) from G.
3 Consider testing problem over family of graph-structured distributions{Q( ;Gst), (s, t) C}.
Why is this ensemble hard?
Kullback-Leibler divergence between pairs decays exponentially in d, unlessminimum edge weight decays as 1/d.
Vignette B: Sparse principal components analysisPrincipal component analysis:
widely-used method for {dimensionality reduction, data compression etc.}extracts top eigenvectors of sample covariance matrix
classical PCA in p dimensions: inconsistent unless p/n 0low-dimensional structure: many applications lead to sparse eigenvectors
Population model: Rank-one spiked covariance matrix
= SNR parameter
()T rank-one spike
+Ip
Sampling model: Draw n i.i.d. zero-mean vectors with cov(Xi) = , andform sample covariance matrix
:=1
n
n
i=1
XiXTi
p-dim. matrix, rank min{n, p}
Example: PCA for face compression/recognition
First 25 standard principal components (estimated from data)
Example: PCA for face compression/recognition
First 25 sparse principal components (estimated from data)
Perhaps the simplest estimator....Diagonal thresholding: (Johnstone & Lu, 2008)
0 10 20 30 40 500
2
4
6
8
10
12
14
Index
Dia
gona
l val
ue
Diagonal thresholding
Given n i.i.d. samples Xi with zero mean, and with spiked covariance = ()T + I:
1 Compute diagonal entries of sample covariance: jj =1n
ni=1X
2ij .
2 Apply threshold to vector {jj , j = 1, . . . , p}
Diagonal thresholding: unrescaled plots
0 200 400 600 8000
0.2
0.4
0.6
0.8
1
Number of observations
Pro
b. s
ucce
ss
Unrescaled plots of diagonal thresholding; k = O(log p)
p = 100p = 200p = 300p = 600p = 1200
Diagonal thresholding: rescaled plots
0 5 10 150
0.2
0.4
0.6
0.8
1
Control parameter
Pro
b. s
ucce
ss
Diagonal thresholding: k = O(log(p))
p = 100p = 200p = 300p = 600p = 1200
Scaling is quadratic in sparsity: nk2 log p
Diagonal thresholding and fundamental limitConsider spiked covariance matrix
= Signal-to-noise
()T rank one spike
+Ipp where B0(k) B2(1) k-sparse and unit norm
Theorem (Amini & W., 2009)
(a) There are thresholds DT DTu such thatn
k2 log p DT
DT fails w.h.p.
n
k2 log p DTu
DT succeeds w.h.p.
(b) For optimal method (exhaustive search):
n
k log p< ES
Fail w.h.p.
n
k log p> ES
Succeed w.h.p.
.
One polynomial-time SDP relaxationRecall Courant-Fisher variational formulation:
= arg maxz2=1
{zT
(()T + Ipp
)
Population covariance
z}.
Equivalently, lifting to matrix variables Z = zzT :
T = arg maxZRpp
Z=ZT , Z0
{trace(ZT)
}s.t. trace(Z) = 1, and rank(Z) = 1
Dropping rank constraint yields a standard SDP relaxation:
ZT = arg maxZRpp
Z=ZT , Z0
{trace(ZT)
}s.t. trace(Z) = 1.
In practice: (dAspremont et al., 2008)
apply this relaxation using the sample covariance matrix
add the 1-constraintp
i,j=1 |Zij | k2.
Phase transition for SDP: logarithmic sparsity
0 5 10 150
0.2
0.4
0.6
0.8
1
Control parameter
Pro
b. s
ucce
ss
SDP relaxation (k = O(log p))
p = 100
p = 200
p = 300
Scaling is linear in sparsity: nk log p
A natural question
Questions:
Can logarithmic sparsity or rank one condition be removed?
Computational lower bound for sparse PCAConsider spiked covariance matrix
= signal-to-noise
()()T rank one spike
+Ipp where B0(k) B2(1)
k-sparse and unit norm
Sparse PCA detection problem:H0 (no signal): Xi D(0, Ipp)H1 (spiked signal): Xi D(0,).
Distribution D with sub-exponential tail behavior.
Theorem (Berthet & Rigollet, 2013)
Under average-case hardness of planted clique, polynomial-minimax level ofdetection POLY is given by
POLY k2 log p
nfor all sparsity log p k p
Classical minimax level OPT k log pn .
Planted k-clique problem
Erdos-Renyi Planted k-clique
Binary hypothesis test based on observing random graph G on p-vertices:
H0 Erdos-Renyi, each edge randomly with prob. 1/2H1 Planted k-clique, remaining edges random
Planted k-clique problem
Random entries Planted k k sub-matrix
Binary hypothesis test based on observing random binary matrix:
H0 Random {0, 1} matrix with Ber(1/2) on off-diagonalH1 Planted k k sub-matrix
Vignette C: (Structured) non-parametric regression
Goal: How to predict output from covariates?
given covariates (x1, x2, x3, . . . , xp)output variable ywant to predict y based on (x1, . . . , xp)
Examples: Medical Imaging; Geostatistics; Astronomy; ComputationalBiology .....
0.5 0.25 0 0.25 0.51.5
1
0.5
0
0.5
Design value x
Fun
ctio
n va
lue
2nd order polynomial kernel
0.5 0.25 0 0.25 0.51.5
1
0.5
0
0.5
Design value x
Fun
ctio
n va
lue
1st order Sobolev
(a) Second-order polynomial fit (b) Lipschitz function fit
High dimensions and sample complexityPossible models:
ordinary linear regression: y =
p
j=1
jxj
, x
+w
general non-parametric model: y = f(x1, x2, . . . , xp) + w.
Estimation accuracy: How well can f be estimated using n samples?
linear models without any structure: error 2 p/n
linear in p
with sparsity k p: error 2 (k log
ep
k
)/n
logarithmic in p
non-parametric models: p-dimensional, smoothness
Curse of dimensionality: Error 2 (1/n) 22+p Exponential slow-down
Minimax risk and sample size
Consider a function class F , and n i.i.d. samples from the model
yi = f(xi) + wi, where f
is some member of F .
For a given estimator {(xi, yi)}ni=1 7 f F , worst-case risk in a metric :
Rnworst(f ;F) = supfF
En[2(f , f)].
Minimax risk
For a given sample size n, the minimax risk
inffRnworst(f ;F) = inf
fsupfF
En[2(f , f)
]
where the infimum is taken over all estimators.
(UC Berkeley and Princeton) Information theory and machine learning June 2015 37 / 46
How to measure size of function classes?
A 2-packing of F in metric is acollection {f1, . . . , fM} F such that
(f j , fk
)> 2 for all j 6= k.
The packing number M(2) is thecardinality of the largest such set.
Packing/covering entropy: emerged from Russian school in 1940s/1950s(Kolmogorov and collaborators)
Central object in proving minimax lower bounds for nonparametricproblems (e.g., Hasminskii & Ibragimov, 1978; Birge, 1983; Yu, 1997; Yang &Barron, 1999)
Packing and covering numbers
f1
f2 f3
f4
fM
2
A 2-packing of F in metric is a collection {f1, . . . , fM} F such that
(f j , fk
)> 2 for all j 6= k.
The packing number M(2) is the cardinality of the largest such set.
Example: Sup-norm packing for Lipschitz functions
L
L
L
L 2 M
-packing set: functions {f1, f2, . . . , fM} such that f j fk2 > for allj 6= kfor L-Lipschitz functions in 1-dimension:
M() 2(L/).
Standard reduction: from estimation to testing
f1
f2 f3f4
fM
2
goal: to characterize the minimax risk for-estimation over Fconstruct a 2-packing of F :
collection {f1, . . . , fM} such that (f j , fk) > 2
now form a M -component mixturedistribution as follows:
draw packing index V {1, . . .M}uniformly at random
conditioned on V = j, draw n i.i.d.samples (Xi, Yi) Qfj
1 Claim: Any estimator f such that (f , fJ ) w.h.p. can be used tosolve the M -ary testing problem.
2 Use standard techniques ({Assouad, Le Cam, Fano }) to lower bound theprobability of error in the testing problem.
Minimax rate via metric entropy matching
observe (xi, yi) pairs from model yi = f(xi) + wi
two possible norms
f f2n :=1
n
n
i=1
(f(xi) f(xi)
)2, or f f22 = E
[(f(X) f(X))2
].
Metric entropy master equation
For many regression problems, minimax rate n > 0 determined by solving themaster equation
logM(2;F ,
) n2.
basic idea (with Hellinger metric) dates back to Le Cam (1973)
elegant general version due to Yang and Barron (1999)
(UC Berkeley and Princeton) Information theory and machine learning June 2015 42 / 46
Example 1: Sparse linear regression
Observations yi = xi, + wi, where
(k, p) :={ Rp | 0 k, and 2 1
}.
Gilbert-Varshamov: can construct a 2-separated set with
logM(2) k log(epk) elements
Master equation and minimax rate
logM(2) n2 2 k log(
ep
k
)n .
Polynomial-time achievability:
by 1-relaxations under restricted eigenvalue (RE) conditions(Candes & Tao, 2007; Bickel et al., 2009; Buhlmann & van de Geer, 2011)
achieve minimax-optimal rates for 2-error (Raskutti, W., & Yu, 2011)
1-methods can be sub-optimal for prediction error X( )2/n
(Zhang, W. & Jordan, 2014)
Example 2: -smooth non-parametric regression
Observations yi = f(xi) + wi, where
f F() ={f : [0, 1] R | f is -times diffble, with j=0 f (j) C
}.
Classical results in approximation theory (e.g., Kolmogorov & Tikhomorov, 1959)
logM(2;F) (1/
)1/
Master equation and minimax rate
logM(2) n2 2 (1/n
) 22+1 .
(UC Berkeley and Princeton) Information theory and machine learning June 2015 44 / 46
Example 3: Sparse additive and -smooth
Observations yi = f(xi) + wi, where f
satisfies constraints
Additively decomposable: f(x1, . . . xp) =
p
j=1
gj (xj)
Sparse: At most k of (g1 , . . . , gp) are non-zero
Smooth: Each gj belongs to -smooth family F()
Combining previous results yields 2-packing with
logM(2;F) k(1/
)1/+ k log
(epk
)
Master equation and minimax rate
Solving logM(2) n2 yields
2 k(1/n)
22+1
k-component estimation
+k log
(epk
)
n search complexity
SummaryTo be provided during tutorial....