Upload
ngodiep
View
217
Download
0
Embed Size (px)
Citation preview
CS 5614: (Big) Data Management Systems
B. Aditya Prakash Lecture #20: Machine Learning 2
Supervised Learning § Example: Spam filtering
§ Instance space x ∈ X (|X|= n data points) – Binary or real-‐valued feature vector x of word occurrences
– d features (words + other things, d~100,000) § Class y ∈ Y
– y: Spam (+1), Ham (-‐1)
§ Goal: EsJmate a funcDon f(x) so that y = f(x) VT CS 5614 2 Prakash 2014
More generally: Supervised Learning
§ Would like to do predicJon: esJmate a funcDon f(x) so that y = f(x)
§ Where y can be: – Real number: Regression – Categorical: ClassificaJon – Complex object:
• Ranking of items, Parse tree, etc.
§ Data is labeled: – Have many pairs {(x, y)}
• x … vector of binary, categorical, real valued features • y … class ({+1, -‐1}, or a real number)
VT CS 5614 3 Prakash 2014
Supervised Learning § Task: Given data (X,Y) build a model f() to predict Y’
based on X’ § Strategy: EsJmate
on . Hope that the same also works to predict unknown – The “hope” is called generalizaJon
• Overfi\ng: If f(x) predicts well Y but is unable to predict Y’ – We want to build a model that generalizes well to unseen data
• But how can we well on data we have never seen before?!?
Prakash 2014 VT CS 5614 4
X Y
X’ Y’ Test data
Training data
Supervised Learning § Idea: Pretend we do not know the data/labels we actually do know – Build the model f(x) on the training data See how well f(x) does on the test data
• If it does well, then apply it also to X’
§ Refinement: Cross validaJon – Spli\ng into training/validaDon set is brutal – Let’s split our data (X,Y) into 10-‐folds (buckets) – Take out 1-‐fold for validaDon, train on remaining 9 – Repeat this 10 Dmes, report average performance
VT CS 5614 5
X Y
X’
Validation set
Training set
Test set
Prakash 2014
Linear models for classificaJon § Binary classificaJon:
§ Input: Vectors x(j) and labels y(j) – Vectors x(j) are real valued where
§ Goal: Find vector w = (w1, w2 ,... , wd ) – Each wi is a real number
VT CS 5614 6
f (x) = +1 if w1 x1 + w2 x2 +. . . wd xd ≥ θ -1 otherwise {
w ⋅ x = 0 -‐ -‐ -‐ -‐ -‐ -‐ -‐ -‐ -‐
-‐ -‐
-‐ -‐ -‐
-‐
θ−⇔
∀⇔
,
1,
ww
xxxNote:
Decision boundary is linear
w
Prakash 2014
||x||2 = 1
Perceptron [Rosenbla` ‘58]
§ (very) Loose moJvaJon: Neuron § Inputs are feature values § Each feature has a weight wi
§ AcJvaJon is the sum: – f(x) = Σi wi xi = w⋅ x
§ If the f(x) is: – PosiJve: Predict +1 – NegaJve: Predict -‐1
VT CS 5614 7
∑
x1 x2 x3 x4
≥ 0?
w1 w2 w3 w4
viagra nigeria
Spam=1
Ham=-1 w
x(1) x(2)
w⋅x=0
Prakash 2014
Perceptron: EsJmaJng w § Perceptron: y’ = sign(w⋅ x) § How to find parameters w?
– Start with w0 = 0 – Pick training examples x(t) one by one (from disk) – Predict class of x(t) using current weights
• y’ = sign(w(t)⋅ x(t))
– If y’ is correct (i.e., yt = y’) • No change: w(t+1) = w(t)
– If y’ is wrong: adjust w(t) w(t+1) = w(t) + η ⋅ y (t) ⋅ x(t)
– η is the learning rate parameter – x(t) is the t-‐th training example – y(t) is true t-‐th class label ({+1, -‐1})
VT CS 5614 8
w(t)
η⋅y(t)⋅x(t)
x(t), y(t)=1
w(t+1)
Note that the Perceptron is a conservative algorithm: it ignores samples that it classifies correctly.
Prakash 2014
Perceptron: The Good and the Bad
§ Good: Perceptron convergence theorem: – If there exist a set of weights that are consistent (i.e., the data is linearly separable) the Perceptron learning algorithm will converge
§ Bad: Never converges: If the data is not separable weights dance around indefinitely
§ Bad: Mediocre generalizaJon: – Finds a “barely” separaDng soluDon
VT CS 5614 9 Prakash 2014
UpdaJng the Learning Rate § Perceptron will oscillate and won’t converge § So, when to stop learning? § (1) Slowly decrease the learning rate η
– A classic way is to: η = c1/(t + c2) • But, we also need to determine constants c1 and c2
§ (2) Stop when the training error stops chaining § (3) Have a small test dataset and stop when the test set error stops decreasing
§ (4) Stop when we reached some maximum number of passes over the data
VT CS 5614 10 Prakash 2014
SUPPORT VECTOR MACHINES
Prakash 2014 VT CS 5614 11
Support Vector Machines
§ Want to separate “+” from “-‐” using a line
VT CS 5614 12
Data: § Training examples:
– (x1, y1) … (xn, yn) § Each example i:
– xi = ( xi(1),… , xi(d) ) • xi(j) is real valued
– yi ∈ { -‐1, +1 } § Inner product:
+ +
+
+
+ + -‐ -‐
-‐
-‐ -‐ -‐
-‐
Which is best linear separator (defined by w)? Prakash 2014
w · x =Pd
j=1 w(j) · x(j)
+ +
+ +
+
+
+
+
+
-‐
-‐ -‐
-‐ -‐
-‐
-‐
-‐
-‐
A
B
C
Largest Margin § Distance from the separaJng hyperplane corresponds to the “confidence” of predicJon
§ Example: – We are more sure about the class of A and B than of C
VT CS 5614 13 Prakash 2014
Largest Margin § Margin : Distance of closest example from the decision line/hyperplane
VT CS 5614 14 The reason we define margin this way is due to theoretical convenience and existence of
generalization error bounds that depend on the value of margin. Prakash 2014
Why maximizing 𝜸 a good idea?
§ Remember: Dot product 𝑨⋅𝑩=‖𝑨‖⋅‖𝑩‖⋅ 𝐜𝐨𝐬𝜽
VT CS 5614 15
‖𝑨‖𝒄𝒐𝒔𝜽
Prakash 2014
Why maximizing 𝜸 a good idea?
Prakash 2014 VT CS 5614 16
A (xA(1), xA(2))
M (x1, x2)
H
d(A, L) = |AH| = |(A-M) · w| = |(xA
(1) – xM(1)) w(1) + (xA
(2) – xM(2)) w(2)|
= xA(1) w(1) + xA
(2) w(2) + b = w · A + b
Remember xM(1)w(1) + xM
(2)w(2) = - b since M belongs to line L
w L
+
What is the margin? § Let:
– Line L: w·∙x+b = w(1)x(1)+w(2)x(2)+b=0
– w = (w(1), w(2)) – Point A = (xA(1), xA(2)) – Point M on a line = (xM(1), xM(2))
(0,0)
VT CS 5614 17 Prakash 2014
Largest Margin
Prakash 2014 VT CS 5614 18
Support Vector Machine § Maximize the margin:
– Good according to intuiJon, theory (VC dimension) & pracJce
– 𝜸 is margin … distance from is margin … distance from the separaJng hyperplane
VT CS 5614 19
+ + +
+ +
+
+
+ -‐
-‐ -‐ -‐ -‐ -‐
-‐
w⋅x+b=0
γ γ
Maximizing the margin
γ γ
γγ
≥+⋅∀ )(,..
max,
bxwyits ii
w
Prakash 2014
SUPPORT VECTOR MACHINES: DERIVING THE MARGIN
Prakash 2014 VT CS 5614 20
Support Vector Machines
§ SeparaJng hyperplane is defined by the support vectors – Points on +/-‐ planes from the soluDon
– If you knew these points, you could ignore the rest
– Generally, d+1 support vectors (for d dim. data)
VT CS 5614 21 Prakash 2014
Canonical Hyperplane: Problem
Prakash 2014 VT CS 5614 22
Canonical Hyperplane: SoluJon
Prakash 2014 VT CS 5614 23
Maximizing the Margin
Prakash 2014 VT CS 5614 24
Non-‐linearly Separable Data
§ If data is not separable introduce penalty:
– Minimize ǁ‖wǁ‖2 plus the number of training mistakes
– Set C using cross validaDon
§ How to penalize mistakes? – All mistakes are not equally bad!
VT CS 5614 25
1)(,..mistakes) ofnumber (# C min 2
21
≥+⋅∀
⋅+
bxwyitsw
ii
w
+ +
+ +
+
+
+ -‐
-‐ -‐
-‐
-‐ -‐
-‐
+ -‐
-‐
Prakash 2014
Support Vector Machines § Introduce slack variables ξi
§ If point xi is on the wrong side of the margin then get penalty ξi
VT CS 5614 26
iii
n
iibw
bxwyits
Cwi
ξ
ξξ
−≥+⋅∀
⋅+ ∑=
≥
1)(,..
min1
221
0,,
+ +
+ +
+
+ + -‐ -‐
-‐ -‐ -‐
For each data point: If margin ≥ 1, don’t care If margin < 1, pay linear penalty
+
ξj
-‐ ξi
Prakash 2014
Slack Penalty 𝑪
§ What is the role of slack penalty C: – C=∞: Only want to w, b that separate the data
– C=0: Can set ξi to anything, then w=0 (basically ignores the data)
VT CS 5614 27
1)(,..mistakes) ofnumber (# C min 2
21
≥+⋅∀
⋅+
bxwyitsw
ii
w
+ +
+ +
+
+ + -‐ -‐
-‐ -‐ -‐
+ -‐
big C
“good” C small C
(0,0) Prakash 2014
Support Vector Machines § SVM in the “natural” form
§ SVM uses “Hinge Loss”:
VT CS 5614 28
{ }∑=
+⋅−⋅+⋅n
iii
bwbxwyCww
121
,)(1,0max minarg
Margin Empirical loss L (how well we fit training data)
Regularization parameter
iii
n
iibw
bxwyits
Cw
ξ
ξ
−≥+⋅⋅∀
+ ∑=
1)(,..
min1
221
,
-‐1 0 1 2
0/1 loss
pena
lty
)( bwxyz ii +⋅⋅=
Hinge loss: max{0, 1-z}
Prakash 2014
SUPPORT VECTOR MACHINES: HOW TO COMPUTE THE MARGIN?
Prakash 2014 VT CS 5614 29
SVM: How to esJmate w?
§ Want to esJmate and ! – Standard way: Use a solver!
• Solver: sopware for finding soluDons to “common” opDmizaDon problems
§ Use a quadraJc solver: – Minimize quadraDc funcDon – Subject to linear constraints
§ Problem: Solvers are inefficient for big data! VT CS 5614 30
iii
n
iibw
bwxyits
Cww
ξ
ξ
−≥+⋅⋅∀
⋅+⋅ ∑=
1)(,..
min1
21
,
Prakash 2014
SVM: How to esJmate w? § Want to esJmate w, b! § AlternaJve approach:
– Want to minimize f(w,b):
§ Side note: – How to minimize convex funcJons ? – Use gradient descent: minz g(z) – Iterate: zt+1 ← zt – η ∇g(zt)
VT CS 5614 31
∑ ∑= = ⎭
⎬⎫
⎩⎨⎧
+−⋅+⋅=n
i
d
j
ji
ji bxwyCwwbwf
1 1
)()(21 )(1,0max),(
iii
n
iibw
bwxyits
Cww
ξ
ξ
−≥+⋅⋅∀
+⋅ ∑=
1)(,..
min1
21
,
g(z)
z Prakash 2014
SVM: How to esJmate w? § Want to minimize f(w,b):
§ Compute the gradient ∇(j) w.r.t. w(j)
VT CS 5614 32
∑= ∂
∂+=
∂
∂=∇
n
ijiij
jj
wyxLCw
wbwff
1)(
)()(
)( ),(),(
else
1)(w if 0),(
)(
)(
jii
iijii
xy
bxywyxL
−=
≥+⋅=∂
∂
( ) ∑ ∑∑= == ⎭
⎬⎫
⎩⎨⎧
+−+=n
i
d
j
ji
ji
d
j
j bxwyCwbwf1 1
)()(
1
2)(21 )(1,0max),(
Empirical loss
Prakash 2014
L(xi, yi)
Iterate until convergence: • For j = 1 … d
• Evaluate: • Update:
w(j) ← w(j) - η∇f(j)
SVM: How to esJmate w? § Gradient descent:
§ Problem: – CompuJng ∇f(j) takes O(n) Jme!
• n … size of the training dataset
VT CS 5614 33
∑= ∂
∂+=
∂
∂=∇
n
ijiij
jj
wyxLCw
wbwff
1)(
)()(
)( ),(),(
η…learning rate parameter C… regularization parameter
Prakash 2014
SVM: How to esJmate w? § StochasJc Gradient Descent
– Instead of evaluaDng gradient over all examples evaluate it for each individual training example
§ StochasJc gradient descent:
VT CS 5614 34
)()()( ),()( j
iiji
j
wyxLCwxf
∂
∂⋅+=∇
∑= ∂
∂+=∇
n
ijiijj
wyxLCwf
1)(
)()( ),(We just had:
Iterate until convergence: • For i = 1 … n
• For j = 1 … d • Compute: ∇f(j)(xi) • Update: w(j) ← w(j) - η ∇f(j)(xi)
Notice: no summation over i anymore
Prakash 2014
SUPPORT VECTOR MACHINES: EXAMPLE
Prakash 2014 VT CS 5614 35
Example: Text categorizaJon
§ Example by Leon Bo`ou: – Reuters RCV1 document corpus
• Predict a category of a document – One vs. the rest classificaDon
– n = 781,000 training examples (documents) – 23,000 test examples – d = 50,000 features
• One feature per word • Remove stop-‐words • Remove low frequency words
VT CS 5614 36 Prakash 2014
Example: Text categorizaJon § QuesJons:
– (1) Is SGD successful at minimizing f(w,b)? – (2) How quickly does SGD find the min of f(w,b)? – (3) What is the error on a test set?
VT CS 5614 37
Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM
(1) SGD-SVM is successful at minimizing the value of f(w,b) (2) SGD-SVM is super fast (3) SGD-SVM test set error is comparable
Prakash 2014
OpJmizaJon “Accuracy”
VT CS 5614 38
Optimization quality: | f(w,b) – f (wopt,bopt) |
Conventional SVM
SGD SVM
For optimizing f(w,b) within reasonable quality SGD-SVM is super fast
Prakash 2014
SGD vs. Batch Conjugate Gradient
§ SGD on full dataset vs. Conjugate Gradient on a sample of n training examples
VT CS 5614 39
Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times
Theory says: Gradient descent converges in linear time . Conjugate gradient
converges in .
𝒌… condition number Prakash 2014
PracJcal ConsideraJons § Need to choose learning rate η and t0
§ Leon suggests: – Choose t0 so that the expected iniDal updates are comparable with the
expected size of the weights – Choose η:
• Select a small subsample • Try various rates η (e.g., 10, 1, 0.1, 0.01, …) • Pick the one that most reduces the cost • Use η for next 100k iteraDons on the full dataset
VT CS 5614 40
⎟⎠
⎞⎜⎝
⎛∂
∂+
+−←+ w
yxLCwtt
ww iit
ttt
),(
01
η
Prakash 2014
PracJcal ConsideraJons
§ Sparse Linear SVM: – Feature vector xi is sparse (contains many zeros)
• Do not do: xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…] • But represent xi as a sparse vector xi=[(4,1), (9,5), …]
– Can we do the SGD update more efficiently?
– Approximated in 2 steps: cheap: xi is sparse and so few coordinates j of w will be updated expensive: w is not sparse, all coordinates need to be updated
VT CS 5614 41
⎟⎠
⎞⎜⎝
⎛∂
∂+−←
wyxLCwww ii ),(η
wyxLCww ii
∂∂
−←),(
η
)1( η−← ww
Prakash 2014
PracJcal ConsideraJons
VT CS 5614 42 Prakash 2014
PracJcal ConsideraJons
§ Stopping criteria: How many iteraJons of SGD?
– Early stopping with cross validaJon • Create a validaDon set • Monitor cost funcDon on the validaDon set • Stop when loss stops decreasing
– Early stopping • Extract two disjoint subsamples A and B of training data • Train on A, stop by validaDng on B • Number of epochs is an esDmate of k • Train for k epochs on the full dataset
VT CS 5614 43 Prakash 2014
What about mulJple classes? § Idea 1: One against all Learn 3 classifiers – + vs. {o, -‐} – -‐ vs. {o, +} – o vs. {+, -‐} Obtain: w+ b+, w-‐ b-‐, wo bo
§ How to classify? § Return class c
arg maxc wc x + bc VT CS 5614 44 Prakash 2014
Learn 1 classifier: MulJclass SVM § Idea 2: Learn 3 sets of weights simultaneously!
– For each class c esDmate wc, bc – Want the correct class to have highest margin:
wyi xi + byi ≥ 1 + wc xi + bc ∀c ≠ yi , ∀i
VT CS 5614 45
(xi, yi)
Prakash 2014
MulJclass SVM
§ OpJmizaJon problem:
– To obtain parameters wc , bc (for each class c) we can use similar techniques as for 2 class SVM
§ SVM is widely perceived a very powerful learning algorithm
VT CS 5614 46
icicyiy
n
iicbw
bxwbxw
Cw
iiξ
ξ
−++⋅≥+⋅
+ ∑∑=
1
min1c
221
,
iiyc
i
i
∀≥
∀≠∀
,0,
ξ
Prakash 2014