Upload
camilla-holland
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Ran El-Yaniv and Dmitry Pechyony
Technion – Israel Institute of Technology ,Haifa, Israel
24.08.2007
Transductive Rademacher Complexity and its
Applications
Induction vs. TransductionInductive learning:
Distribution
of examples
training set learning algorit
hm
hypothesis
labels
unlabeled examples
Transductive learning (Vapnik ’74,’98):training set
test set
learning algorit
hm
labels of the test set
Goal: minimize
Goal: minimize
f (xi ;yi )gmi=1
SmM= f (xi ;yi )gm
i=1
X uM= fxi g
m+ui=m+1
L uM= E(x;y)2X u f (̀h(x);y)g
E(x;y)» D f (̀h(x);y)gD
(x;y)
Distribution-free Model [Vapnik ’74,’98]
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X XXX X
X
Given: “Full sample” of unlabeled examples, each with its true (unknown) label.
m+ um+ u
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X XXX X
X
Given: “Full sample” of unlabeled examples, each withits true (unknown) label.
m+ um+ u
Full sample is partitioned: training set (m points) test set (u points)
Distribution-free Model [Vapnik ’74,’98]
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X
XX
X X XXX X
XLabels of the training
examples are revealed.
Given: “Full sample” of unlabeled examples, each with its true (unknown) label.
m+ um+ u
Full sample is partitioned: training set (m points) test set (u points)
Distribution-free Model [Vapnik ’74,’98]
Labels of the training points are revealed.
Goal: Label test examples
X?
? X
??
? ?
X?
? ?
??
? ?
X?
X ?
??
? ?
??
X ?
??
? ? ?X? ?
?
Given: “Full sample” of unlabeled examples, each with its true (unknown) label.
m+ um+ u
Full sample is partitioned: training set (m points) test set (u points)
Distribution-free Model [Vapnik ’74,’98]
Rademacher complexity Induction Hypothesis space : set of functions . - training points. - i.i.d. random
variables, Rademacher:
Transduction (version 1) Hypothesis space : set of vectors , .
- full sample with training and test
points. - distributed as in induction.
Rademacher:
X mX m ¾= f¾i gm
i=1
Rm(F ) = 1m EX m E¾f supf 2F
P mi=1 ¾i f (xi )gRm(F ) = 1
m EX m E¾f supf 2FP m
i=1 ¾i f (xi )g
FF f : D ! Rf : D ! R
Prf¾i = 1g= Prf¾i = ¡ 1g= 12
Prf¾i = 1g= Prf¾i = ¡ 1g= 12
Rm+u(H) = ( 1m + 1
u ) ¢E¾f suph2HP m+u
i=1 ¾i hi gRm+u(H) = ( 1m + 1
u ) ¢E¾f suph2HP m+u
i=1 ¾i hi g
HH hh H µ Rm+uH µ Rm+u
X m+uX m+u mm uu¾= f¾i gm+u
i=1
Transductive Rademacher complexity Version 1: - full sample with training and test
points. - transductive hypothesis space. - i.i.d. random
variables distributed by : .
Rademacher complexity:
Version 2: sparse distribution, , of Rademacher variables
We develop risk bounds with .
X m+uX m+u mm uu
¾= f¾i gm+ui=1HH
Rm+u(H;D1) = ( 1m + 1
u ) ¢E¾» D1 f suph2HP m+u
i=1 ¾i hi gRm+u(H;D1) = ( 1m + 1
u ) ¢E¾» D1 f suph2HP m+u
i=1 ¾i hi g
Prf¾i = 1g= Prf¾i = ¡ 1g= 12
Prf¾i = 1g= Prf¾i = ¡ 1g= 12
Prf¾i = 1g= Prf¾i = ¡ 1g= mu(m+u)2
Prf¾i = 1g= Prf¾i = ¡ 1g= mu(m+u)2
Prf¾i = 0g= 1¡ 2 mu(m+u)2
Prf¾i = 0g= 1¡ 2 mu(m+u)2
D1D1
DsDs
Lemma 1: .
Rm+u(H;Ds) · Rm+u(H;D1)Rm+u(H;Ds) · Rm+u(H;D1)
Rm+u(H) M= Rm+u(H;Ds)Rm+u(H) M= Rm+u(H;Ds)
Risk boundNotation: - 0/1 error of on test examples . - empirical -margin error of on training examples
.
Theorem: For any , with probability at least
over the random partition of the full sample into
, for all hypotheses it holds that
.
Proof: based on and inspired by the results of [McDiarmid, ‘89],
[Bartlett and Mendelson, ‘02] and [Meir and Zhang, ‘03].
Previous results: [Lanckriet et al., ‘04] - case of .
h 2 Hh 2 H
±> 0; ° > 0±> 0; ° > 0 1¡ ±1¡ ±
Sm+uSm+u
(Sm;X u)(Sm;X u)
L u(h)L u(h)
L °m(h)L °
m(h)hh X u
X u
hh SmSm°°
L u(h) · L °m(h) + 1
° Rm+u(H) + O³ q ¡
1m + 1
u
¢ln 1
±
´L u(h) · L °
m(h) + 1° Rm+u(H) + O
³ q ¡1m + 1
u
¢ln 1
±
´
m = um = u
Inductive vs. Transductive hypothesis spacesInduction: To use the risk bounds, the hypothesis space
shouldbe defined before observing the training set.
Transduction: The hypothesis space can be defined afterobserving , but before observing the actual
partition .
Conclusion: Transduction allows for the choosing a data-dependent hypothesis space. For example, it can beoptimized to have low Rademacher complexity.
This cannot be done in induction!
X m+uX m+u
(Sm;X u)(Sm;X u)
Another view on transductive algorithms
learner
compute KK
®®compute
h = K ®h = K ®
X m+uX m+u
(Sm;X u)(Sm;X u)
(m+ u) £ r(m+ u) £ rmatrix
r £ 1r £ 1vector
Example:
- inverse of graph Laplacian iff ; otherwise.
KK
®i = yi®i = yi xi 2 Sm
xi 2 Sm
®i = 0®i = 0
Unlabeled-Labeled Decomposition (ULD)
Bounding Rademacher complexity
Hypothesis space : the set of all , obtained by operatingtransductive algorithm on all possible partitions .
Notation: , - set of ‘s generated by . - all singular values of .
Lemma 2:
Lemma 2 justifies the spectral transformations performedto improve the performance of transductive algorithms ([Chapelle et al.,’02], [Joachims,’03], [Zhang and Ando,‘05]).
.
= f ! i gri=1
= f ! i gri=1 KK
h = K ®h = K ®
HAHA hh
(Sm;X u)(Sm;X u)AA
TT ®® AA¹ = sup®2T f k®k2g¹ = sup®2T f k®k2g
Rm+u(HA ) · ¹q
2mu
P ri=1 ! 2
iRm+u(HA ) · ¹
q2
mu
P ri=1 ! 2
i
Bounds for graph-based algorithms
Consistency method [Zhou, Bousquet, Lal, Weston, Scholkopf,
‘03]:
where are singular values of .
Similar bounds for the algorithms of [Joachims,’03],
[Belkin et al., ‘04], etc.
Rm+u(HA ) ·q
2u
P m+ui=1 ! 2
iRm+u(HA ) ·
q2u
P m+ui=1 ! 2
i
f ! i gm+ui=1
f ! i gm+ui=1 KK
Topics not covered Bounding the Rademacher complexity
when is a kernel matrix. For some algorithms: data-dependent
method of computing probabilistic upper and lower bounds on Rademacher complexity.
Risk bound for transductive mixtures.
KK
Direction for future research
Tighten the risk bound to allow effective model
selection: Bound depending on 0/1 empirical error. Usage of variance information to obtain
better convergence rate. Local transductive Rademacher
complexity. Clever data-dependent choice of low-
Rademacher hypothesis spaces.
Monte Carlo estimation of transductive Rademacher complexity
Rm+u(H) = ( 1m + 1
u ) ¢E¾f suph2H ¾¢hgRm+u(H) = ( 1m + 1
u ) ¢E¾f suph2H ¾¢hgRademacher: .
Draw uniformly vectors of Rademacher variables, .
By Hoeffding inequality: for any , with prob. at least ,
.
How to compute the supremum? For the Consistency Method of [Zhou et al., ‘03] can be computed in time.
Symmetric Hoeffding inequality probabilistic lower bound on the transductive Rademacher complexity.
±> 0±> 0
nn ¾(1); : : : ;¾(n)¾(1); : : : ;¾(n)
1¡ ±1¡ ±
Rm+u(H) · ( 1m + 1
u ) ¢1n
P ni=1 suph2H ¾(i ) ¢h + O
³ q1n ln 1
±
´Rm+u(H) · ( 1
m + 1u ) ¢1
n
P ni=1 suph2H ¾(i ) ¢h + O
³ q1n ln 1
±
´
suph2H ¾(i ) ¢hsuph2H ¾(i ) ¢h
O¡(m+ u)2
¢O¡(m+ u)2
¢
!!
Induction vs. Transduction: differences
Induction Unknown
underlying distribution
Transduction No unknown
distribution. Each example has unique label.
Test examples not known. Will be sampled from the same distribution.
Test examples are known.
Generate a general hypothesis.
Want generalization!
Only classify given examples.
No generalization!
Independent training examples.
Dependent training and test examples.
Justification of spectral transformations
, - set of ‘s generated by .
- all singular values of .
Lemma 2: . Lemma 2 justifies the spectral transformations
performedto improve the performance of transductive
algorithms ([Chapelle et al.,’02], [Joachims,’03], [Zhang
and Ando,‘05]).
= f ! i gri=1
= f ! i gri=1 KK
TT ®® AA¹ = sup®2T f k®k2g¹ = sup®2T f k®k2g
Rm+u(HA ) · ¹q
2mu
P ri=1 ! 2
iRm+u(HA ) · ¹
q2
mu
P ri=1 ! 2
i