CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#

CS 5614: (Big) Data Management Systems

B. Aditya Prakash Lecture #20: Machine Learning 2

Supervised Learning §  Example: Spam filtering

§  Instance space x ∈ X (|X|= n data points) –  Binary or real-‐valued feature vector x of word occurrences

–  d features (words + other things, d~100,000) §  Class y ∈ Y

–  y: Spam (+1), Ham (-‐1)

§  Goal: EsJmate a funcDon f(x) so that y = f(x) VT CS 5614 2 Prakash 2014

More generally: Supervised Learning

§  Would like to do predicJon: esJmate a funcDon f(x) so that y = f(x)

§  Where y can be: –  Real number: Regression –  Categorical: ClassificaJon –  Complex object:

•  Ranking of items, Parse tree, etc.

§  Data is labeled: –  Have many pairs {(x, y)}

•  x … vector of binary, categorical, real valued features •  y … class ({+1, -‐1}, or a real number)

VT CS 5614 3 Prakash 2014

Supervised Learning §  Task: Given data (X,Y) build a model f() to predict Y’

based on X’ §  Strategy: EsJmate

on . Hope that the same also works to predict unknown –  The “hope” is called generalizaJon

•  Overfi\ng: If f(x) predicts well Y but is unable to predict Y’ –  We want to build a model that generalizes well to unseen data

•  But how can we well on data we have never seen before?!?

Prakash 2014 VT CS 5614 4

X Y

X’ Y’ Test data

Training data

Supervised Learning §  Idea: Pretend we do not know the data/labels we actually do know –  Build the model f(x) on the training data See how well f(x) does on the test data

•  If it does well, then apply it also to X’

§  Refinement: Cross validaJon –  Spli\ng into training/validaDon set is brutal –  Let’s split our data (X,Y) into 10-‐folds (buckets) –  Take out 1-‐fold for validaDon, train on remaining 9 –  Repeat this 10 Dmes, report average performance

VT CS 5614 5

X Y

X’

Validation set

Training set

Test set

Prakash 2014

Linear models for classificaJon §  Binary classificaJon:

§  Input: Vectors x(j) and labels y(j) –  Vectors x(j) are real valued where

§  Goal: Find vector w = (w1, w2 ,... , wd ) –  Each wi is a real number

VT CS 5614 6

f (x) = +1 if w1 x1 + w2 x2 +. . . wd xd ≥ θ -1 otherwise {

w ⋅ x = 0 -‐ -‐ -‐ -‐ -‐ -‐ -‐ -‐ -‐

-‐ -‐

-‐ -‐ -‐

-‐

θ−⇔

∀⇔

,

1,

ww

xxxNote:

Decision boundary is linear

w

Prakash 2014

||x||2 = 1

Perceptron [Rosenbla` ‘58]

§  (very) Loose moJvaJon: Neuron §  Inputs are feature values §  Each feature has a weight wi

§  AcJvaJon is the sum: –  f(x) = Σi wi xi = w⋅ x

§  If the f(x) is: – PosiJve: Predict +1 – NegaJve: Predict -‐1

VT CS 5614 7

∑

x1 x2 x3 x4

≥ 0?

w1 w2 w3 w4

viagra nigeria

Spam=1

Ham=-1 w

x(1) x(2)

w⋅x=0

Prakash 2014

Perceptron: EsJmaJng w §  Perceptron: y’ = sign(w⋅ x) §  How to find parameters w?

–  Start with w0 = 0 –  Pick training examples x(t) one by one (from disk) –  Predict class of x(t) using current weights

•  y’ = sign(w(t)⋅ x(t))

–  If y’ is correct (i.e., yt = y’) •  No change: w(t+1) = w(t)

–  If y’ is wrong: adjust w(t) w(t+1) = w(t) + η ⋅ y (t) ⋅ x(t)

–  η is the learning rate parameter –  x(t) is the t-‐th training example –  y(t) is true t-‐th class label ({+1, -‐1})

VT CS 5614 8

w(t)

η⋅y(t)⋅x(t)

x(t), y(t)=1

w(t+1)

Note that the Perceptron is a conservative algorithm: it ignores samples that it classifies correctly.

Prakash 2014

Perceptron: The Good and the Bad

§  Good: Perceptron convergence theorem: –  If there exist a set of weights that are consistent (i.e., the data is linearly separable) the Perceptron learning algorithm will converge

§  Bad: Never converges: If the data is not separable weights dance around indefinitely

§  Bad: Mediocre generalizaJon: – Finds a “barely” separaDng soluDon


UpdaJng the Learning Rate §  Perceptron will oscillate and won’t converge §  So, when to stop learning? §  (1) Slowly decrease the learning rate η

–  A classic way is to: η = c1/(t + c2) •  But, we also need to determine constants c1 and c2

§  (2) Stop when the training error stops chaining §  (3) Have a small test dataset and stop when the test set error stops decreasing

§  (4) Stop when we reached some maximum number of passes over the data


SUPPORT VECTOR MACHINES


Support Vector Machines

§  Want to separate “+” from “-‐” using a line

VT CS 5614 12

Data: §  Training examples:

–  (x1, y1) … (xn, yn) §  Each example i:

–  xi = ( xi(1),… , xi(d) ) •  xi(j) is real valued

–  yi ∈ { -‐1, +1 } §  Inner product:

+ +

+

+

+ + -‐ -‐

-‐

-‐ -‐ -‐

-‐

Which is best linear separator (defined by w)? Prakash 2014

w · x =Pd

j=1 w(j) · x(j)

+ +

+ +

+

+

+

+

+

-‐

-‐ -‐

-‐ -‐

-‐

-‐

-‐

-‐

A

B

C

Largest Margin §  Distance from the separaJng hyperplane corresponds to the “confidence” of predicJon

§  Example: – We are more sure about the class of A and B than of C


Largest Margin §  Margin : Distance of closest example from the decision line/hyperplane

VT CS 5614 14 The reason we define margin this way is due to theoretical convenience and existence of

generalization error bounds that depend on the value of margin. Prakash 2014

Why maximizing 𝜸 a good idea?

§  Remember: Dot product 𝑨⋅𝑩=‖𝑨‖⋅‖𝑩‖⋅ 𝐜𝐨𝐬𝜽 

VT CS 5614 15

‖𝑨‖𝒄𝒐𝒔𝜽

Prakash 2014

Why maximizing 𝜸 a good idea?


A (xA(1), xA(2))

M (x1, x2)

H

d(A, L) = |AH| = |(A-M) · w| = |(xA

(1) – xM(1)) w(1) + (xA

(2) – xM(2)) w(2)|

= xA(1) w(1) + xA

(2) w(2) + b = w · A + b

Remember xM(1)w(1) + xM

(2)w(2) = - b since M belongs to line L

w L

+

What is the margin? §  Let:

–  Line L: w·∙x+b = w(1)x(1)+w(2)x(2)+b=0

–  w = (w(1), w(2)) –  Point A = (xA(1), xA(2)) –  Point M on a line = (xM(1), xM(2))

(0,0)


Largest Margin


Support Vector Machine § Maximize the margin:

– Good according to intuiJon, theory (VC dimension) & pracJce

– 𝜸 is margin … distance from is margin … distance from the separaJng hyperplane

VT CS 5614 19

+ + +

+ +

+

+

+ -‐

-‐ -‐ -‐ -‐ -‐

-‐

w⋅x+b=0

γ γ

Maximizing the margin

γ γ

γγ

≥+⋅∀ )(,..

max,

bxwyits ii

w

Prakash 2014

SUPPORT VECTOR MACHINES: DERIVING THE MARGIN


Support Vector Machines

§  SeparaJng hyperplane is defined by the support vectors –  Points on +/-‐ planes from the soluDon

–  If you knew these points, you could ignore the rest

–  Generally, d+1 support vectors (for d dim. data)


Canonical Hyperplane: Problem


Canonical Hyperplane: SoluJon


Maximizing the Margin


Non-‐linearly Separable Data

§  If data is not separable introduce penalty:

– Minimize ǁ‖wǁ‖2 plus the number of training mistakes

–  Set C using cross validaDon

§  How to penalize mistakes? – All mistakes are not equally bad!

VT CS 5614 25

1)(,..mistakes) ofnumber (# C min 2

21

≥+⋅∀

⋅+

bxwyitsw

ii

w

+ +

+ +

+

+

+ -‐

-‐ -‐

-‐

-‐ -‐

-‐

+ -‐

-‐

Prakash 2014

Support Vector Machines §  Introduce slack variables ξi

§  If point xi is on the wrong side of the margin then get penalty ξi

VT CS 5614 26

iii

n

iibw

bxwyits

Cwi

ξ

ξξ

−≥+⋅∀

⋅+ ∑=

≥

1)(,..

min1

221

0,,

+ +

+ +

+

+ + -‐ -‐

-‐ -‐ -‐

For each data point: If margin ≥ 1, don’t care If margin < 1, pay linear penalty

+

ξj

-‐ ξi

Prakash 2014

Slack Penalty 𝑪

§ What is the role of slack penalty C: – C=∞: Only want to w, b that separate the data

– C=0: Can set ξi to anything, then w=0 (basically ignores the data)

VT CS 5614 27

1)(,..mistakes) ofnumber (# C min 2

21

≥+⋅∀

⋅+

bxwyitsw

ii

w

+ +

+ +

+

+ + -‐ -‐

-‐ -‐ -‐

+ -‐

big C

“good” C small C

(0,0) Prakash 2014

Support Vector Machines §  SVM in the “natural” form

§  SVM uses “Hinge Loss”:

VT CS 5614 28

{ }∑=

+⋅−⋅+⋅n

iii

bwbxwyCww

121

,)(1,0max minarg

Margin Empirical loss L (how well we fit training data)

Regularization parameter

iii

n

iibw

bxwyits

Cw

ξ

ξ

−≥+⋅⋅∀

+ ∑=

1)(,..

min1

221

,

-‐1 0 1 2

0/1 loss

pena

lty

)( bwxyz ii +⋅⋅=

Hinge loss: max{0, 1-z}

Prakash 2014

SUPPORT VECTOR MACHINES: HOW TO COMPUTE THE MARGIN?


SVM: How to esJmate w?

§  Want to esJmate and ! –  Standard way: Use a solver!

•  Solver: sopware for finding soluDons to “common” opDmizaDon problems

§  Use a quadraJc solver: – Minimize quadraDc funcDon –  Subject to linear constraints

§  Problem: Solvers are inefficient for big data! VT CS 5614 30

iii

n

iibw

bwxyits

Cww

ξ

ξ

−≥+⋅⋅∀

⋅+⋅ ∑=

1)(,..

min1

21

,

Prakash 2014

SVM: How to esJmate w? §  Want to esJmate w, b! §  AlternaJve approach:

– Want to minimize f(w,b):

§  Side note: –  How to minimize convex funcJons ? –  Use gradient descent: minz g(z) –  Iterate: zt+1 ← zt – η ∇g(zt)

VT CS 5614 31

∑ ∑= = ⎭

⎬⎫

⎩⎨⎧

+−⋅+⋅=n

i

d

j

ji

ji bxwyCwwbwf

1 1

)()(21 )(1,0max),(

iii

n

iibw

bwxyits

Cww

ξ

ξ

−≥+⋅⋅∀

+⋅ ∑=

1)(,..

min1

21

,

g(z)

z Prakash 2014

SVM: How to esJmate w? §  Want to minimize f(w,b):

§  Compute the gradient ∇(j) w.r.t. w(j)

VT CS 5614 32

∑= ∂

∂+=

∂

∂=∇

n

ijiij

jj

wyxLCw

wbwff

1)(

)()(

)( ),(),(

else

1)(w if 0),(

)(

)(

jii

iijii

xy

bxywyxL

−=

≥+⋅=∂

∂

( ) ∑ ∑∑= == ⎭

⎬⎫

⎩⎨⎧

+−+=n

i

d

j

ji

ji

d

j

j bxwyCwbwf1 1

)()(

1

2)(21 )(1,0max),(

Empirical loss

Prakash 2014

L(xi, yi)

Iterate until convergence: •  For j = 1 … d

•  Evaluate: •  Update:

w(j) ← w(j) - η∇f(j)

SVM: How to esJmate w? §  Gradient descent:

§  Problem: –  CompuJng ∇f(j) takes O(n) Jme!

•  n … size of the training dataset

VT CS 5614 33

∑= ∂

∂+=

∂

∂=∇

n

ijiij

jj

wyxLCw

wbwff

1)(

)()(

)( ),(),(

η…learning rate parameter C… regularization parameter

Prakash 2014

SVM: How to esJmate w? §  StochasJc Gradient Descent

–  Instead of evaluaDng gradient over all examples evaluate it for each individual training example

§  StochasJc gradient descent:

VT CS 5614 34

)()()( ),()( j

iiji

j

wyxLCwxf

∂

∂⋅+=∇

∑= ∂

∂+=∇

n

ijiijj

wyxLCwf

1)(

)()( ),(We just had:

Iterate until convergence: •  For i = 1 … n

•  For j = 1 … d •  Compute: ∇f(j)(xi) •  Update: w(j) ← w(j) - η ∇f(j)(xi)

Notice: no summation over i anymore

Prakash 2014

SUPPORT VECTOR MACHINES: EXAMPLE


Example: Text categorizaJon

§  Example by Leon Bo`ou: – Reuters RCV1 document corpus

•  Predict a category of a document –  One vs. the rest classificaDon

– n = 781,000 training examples (documents) – 23,000 test examples – d = 50,000 features

•  One feature per word •  Remove stop-‐words •  Remove low frequency words


Example: Text categorizaJon §  QuesJons:

–  (1) Is SGD successful at minimizing f(w,b)? –  (2) How quickly does SGD find the min of f(w,b)? –  (3) What is the error on a test set?

VT CS 5614 37

Training time Value of f(w,b) Test error Standard SVM “Fast SVM” SGD SVM

(1) SGD-SVM is successful at minimizing the value of f(w,b) (2) SGD-SVM is super fast (3) SGD-SVM test set error is comparable

Prakash 2014

OpJmizaJon “Accuracy”

VT CS 5614 38

Optimization quality: | f(w,b) – f (wopt,bopt) |

Conventional SVM

SGD SVM

For optimizing f(w,b) within reasonable quality SGD-SVM is super fast

Prakash 2014

SGD vs. Batch Conjugate Gradient

§  SGD on full dataset vs. Conjugate Gradient on a sample of n training examples

VT CS 5614 39

Bottom line: Doing a simple (but fast) SGD update many times is better than doing a complicated (but slow) CG update a few times

Theory says: Gradient descent converges in linear time . Conjugate gradient

converges in .

𝒌… condition number Prakash 2014

PracJcal ConsideraJons §  Need to choose learning rate η and t0

§  Leon suggests: –  Choose t0 so that the expected iniDal updates are comparable with the

expected size of the weights –  Choose η:

•  Select a small subsample •  Try various rates η (e.g., 10, 1, 0.1, 0.01, …) •  Pick the one that most reduces the cost •  Use η for next 100k iteraDons on the full dataset

VT CS 5614 40

⎟⎠

⎞⎜⎝

⎛∂

∂+

+−←+ w

yxLCwtt

ww iit

ttt

),(

01

η

Prakash 2014

PracJcal ConsideraJons

§  Sparse Linear SVM: –  Feature vector xi is sparse (contains many zeros)

•  Do not do: xi = [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,…] •  But represent xi as a sparse vector xi=[(4,1), (9,5), …]

–  Can we do the SGD update more efficiently?

–  Approximated in 2 steps: cheap: xi is sparse and so few coordinates j of w will be updated expensive: w is not sparse, all coordinates need to be updated

VT CS 5614 41

⎟⎠

⎞⎜⎝

⎛∂

∂+−←

wyxLCwww ii ),(η

wyxLCww ii

∂∂

−←),(

η

)1( η−← ww

Prakash 2014




§  Stopping criteria: How many iteraJons of SGD?

–  Early stopping with cross validaJon •  Create a validaDon set •  Monitor cost funcDon on the validaDon set •  Stop when loss stops decreasing

–  Early stopping •  Extract two disjoint subsamples A and B of training data •  Train on A, stop by validaDng on B •  Number of epochs is an esDmate of k •  Train for k epochs on the full dataset


What about mulJple classes? §  Idea 1: One against all Learn 3 classifiers –  + vs. {o, -‐} –  -‐ vs. {o, +} –  o vs. {+, -‐} Obtain: w+ b+, w-‐ b-‐, wo bo

§  How to classify? §  Return class c

arg maxc wc x + bc VT CS 5614 44 Prakash 2014

Learn 1 classifier: MulJclass SVM §  Idea 2: Learn 3 sets of weights simultaneously!

– For each class c esDmate wc, bc – Want the correct class to have highest margin:

wyi xi + byi ≥ 1 + wc xi + bc ∀c ≠ yi , ∀i

VT CS 5614 45

(xi, yi)

Prakash 2014

MulJclass SVM

§  OpJmizaJon problem:

–  To obtain parameters wc , bc (for each class c) we can use similar techniques as for 2 class SVM

§  SVM is widely perceived a very powerful learning algorithm

VT CS 5614 46

icicyiy

n

iicbw

bxwbxw

Cw

iiξ

ξ

−++⋅≥+⋅

+ ∑∑=

1

min1c

221

,

iiyc

i

i

∀≥

∀≠∀

,0,

ξ

Prakash 2014

Documents

CS5614:(Big)Data# Management#Systems# - Virginia Techpeople.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-20.pdf · CS5614:(Big)Data# Management#Systems# B.#Aditya#Prakash#