Download pdf - Support Vector Machines (SVMs). Semi-Supervised Learning.ninamf/courses/601sp15/slides/18_svm-ssl_03-25-2015... · • Support Vector Machines (SVMs). • Semi-Supervised SVMs

Maria-Florina Balcan 03/25/2015

• Support Vector Machines (SVMs).

• Semi-Supervised SVMs.

• Semi-Supervised Learning.

One of the most theoretically well motivated and practically most effective classification algorithms in machine learning.

Support Vector Machines (SVMs).

Directly motivated by Margins and Kernels!

Geometric Margin

Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0.

𝑥1

w

Margin of example 𝑥1

𝑥2

Margin of example 𝑥2

If 𝑤 = 1, margin of x w.r.t. w is |𝑥 ⋅ 𝑤|.

WLOG homogeneous linear separators [w0 = 0].

+ +

+ + -

- -

-

-

𝛾 𝛾

+

- -

- -

w

Definition: The margin 𝛾 of a set of examples 𝑆 is the maximum 𝛾𝑤 over all linear separators 𝑤.

Geometric Margin

Definition: The margin 𝛾𝑤 of a set of examples 𝑆 wrt a linear separator 𝑤 is the smallest margin over points 𝑥 ∈ 𝑆.

Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0.

Both sample complexity and algorithmic implications.

Margin Important Theme in ML

Sample/Mistake Bound complexity:

• If large margin, # mistakes Peceptron makes is small (independent on the dim of the space)!

• If large margin 𝛾 and if alg. produces a large margin classifier, then amount of data needed depends only on R/𝛾 [Bartlett & Shawe-Taylor ’99].

+ + + + -

- -

- -

𝛾 𝛾

+

- - -

-

w

Suggests searching for a large margin classifier… SVMs

Algorithmic Implications

Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)};

Output: w, a separator of margin 𝛾 over S

Support Vector Machines (SVMs)

Directly optimize for the maximum margin separator: SVMs

Find: some w where:

• w

2= 1

• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 𝛾

First, assume we know a lower bound on the margin 𝛾

+ + + + -

- -

- -

𝛾 𝛾

+

- - -

-

w

Realizable case, where the data is linearly separable by margin 𝛾

Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};

Output: maximum margin separator over S



Find: some w and maximum 𝛾 where:

• w

2= 1


E.g., search for the best possible 𝛾

+ + + + -

- -

- -

𝛾 𝛾

+

- - -

-

w



Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};

Maximize 𝛾 under the constraint:

• w2= 1

• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 𝛾 + +

+ + -

- -

- -

𝛾 𝛾

+

- - -

-

w



Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};


• w2= 1


This is a constrained optimization problem.

objective function

constraints

• Famous example of constrained optimization: linear programming, where objective fn is linear, constraints are linear (in)equalities

This constraint is non-linear.



+ + + + -

- -

- -

𝛾 𝛾

+

- - -

-

w Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};


• w2= 1


In fact, it’s even non-convex

𝑤1

𝑤2

𝑤1 + 𝑤22

Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};




• w2= 1


Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};

Minimize 𝑤′2 under the constraint:

• For all i, 𝑦𝑖𝑤′ ⋅ 𝑥𝑖 ≥ 1

𝑤’ = 𝑤/𝛾, then max 𝛾 is equiv. to minimizing ||𝑤’||2 (since ||𝑤’||2 = 1/𝛾2).

So, dividing both sides by 𝛾 and writing in terms of w’ we get:

+ + + + -

- -

- -

𝛾 𝛾

+

- - -

-

w

+ + + + -

- -

- - +

- - -

-

w’ 𝑤’ ⋅ 𝑥 = −1

𝑤’ ⋅ 𝑥 = 1



Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};

argminw 𝑤2 s.t.:

• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 1

This is a constrained optimization problem.

• The objective is convex (quadratic)

• All constraints are linear

• Can solve efficiently (in poly time) using standard quadratic programing (QP) software


Question: what if data isn’t perfectly linearly separable?

𝑤2+ 𝐶(# misclassifications)

Issue 1: now have two objectives

• maximize margin

• minimize # of misclassifications.

Ans 1: Let’s optimize their sum: minimize

NP-hard [Guruswami-Raghavendra’06]

where 𝐶 is some tradeoff constant.

Issue 2: This is computationally hard (NP-hard).

[even if didn’t care about margin and minimized # mistakes]

+ +

+

+ - - -

-

-

+

- - -

-

w 𝑤 ⋅ 𝑥 = −1

𝑤 ⋅ 𝑥 = 1




Issue 1: now have two objectives

• maximize margin

• minimize # of misclassifications.

Ans 1: Let’s optimize their sum: minimize

NP-hard [Guruswami-Raghavendra’06]

where 𝐶 is some tradeoff constant.

Issue 2: This is computationally hard (NP-hard).

[even if didn’t care about margin and minimized # mistakes]

+ +

+

+ - - -

-

-

+

- - -

-

w 𝑤 ⋅ 𝑥 = −1

𝑤 ⋅ 𝑥 = 1

Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable?

Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};

argminw,𝜉1,…,𝜉𝑚 𝑤2+ 𝐶 𝜉𝑖𝑖 s.t.:

• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 1 − 𝜉𝑖

Find

𝜉𝑖 ≥ 0

𝜉𝑖 are “slack variables”

Replace “# mistakes” with upper bound called “hinge loss”

+ +

+

+ - - -

-

-

+

- - -

-

w 𝑤 ⋅ 𝑥 = −1

𝑤 ⋅ 𝑥 = 1

Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};

Minimize 𝑤′2 under the constraint:

• For all i, 𝑦𝑖𝑤′ ⋅ 𝑥𝑖 ≥ 1

+ + + + -

- -

- - +

- - -

-

w’ 𝑤’ ⋅ 𝑥 = −1

𝑤’ ⋅ 𝑥 = 1



Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};



Find

𝜉𝑖 ≥ 0

𝜉𝑖 are “slack variables”


𝑙 𝑤, 𝑥, 𝑦 = max (0,1 − 𝑦 𝑤 ⋅ 𝑥)

+ +

+

+ - - -

-

-

+

- - -

-

w 𝑤 ⋅ 𝑥 = −1

𝑤 ⋅ 𝑥 = 1

C controls the relative weighting between the

twin goals of making the 𝑤2 small (margin is

large) and ensuring that most examples have functional margin ≥ 1.



Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};



Find

𝜉𝑖 ≥ 0


𝑙 𝑤, 𝑥, 𝑦 = max (0,1 − 𝑦 𝑤 ⋅ 𝑥)

+ +

+

+ - - -

-

-

+

- - -

-

w 𝑤 ⋅ 𝑥 = −1

𝑤 ⋅ 𝑥 = 1


Replace the number of mistakes with the hinge loss



Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};



Find

𝜉𝑖 ≥ 0


𝑙 𝑤, 𝑥, 𝑦 = max (0,1 − 𝑦 𝑤 ⋅ 𝑥)

+ +

+

+ - - -

-

-

+

- - -

-

w 𝑤 ⋅ 𝑥 = −1

𝑤 ⋅ 𝑥 = 1

Total amount have to move the points to get them on the correct side of the lines 𝑤 ⋅ 𝑥 = +1/−1, where the distance between the lines 𝑤 ⋅ 𝑥 = 0 and 𝑤 ⋅ 𝑥 = 1 counts as “1 unit”.

What if the data is far from being linearly separable?

Example: vs No good linear separator in pixel representation.

SVM philosophy: “Use a Kernel”

http://images.google.com/imgres?imgurl=http://www.classicsavers.com/casablanca.jpg&imgrefurl=http://www.classicsavers.com/Casablanca.html&h=600&w=800&sz=72&tbnid=wSXOd5UUibIJ:&tbnh=106&tbnw=141&start=3&prev=/images?q=casablanca&hl=en&lr=&ie=UTF-8



Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};



Which is equivalent to:

Find

𝜉𝑖 ≥ 0

Primal form

Input: S={(x1, y1), …,(xm, ym)};

argminα1

2 yiyj αiαjxi ⋅ xj − αiiji s.t.:

• For all i,

Find

0 ≤ αi ≤ Ci

Lagrangian Dual

yiαi = 0

i

SVMs (Lagrangian Dual)

• Final classifier is: w = αiyixii

• The points xi for which αi ≠ 0 are called the “support vectors”

Input: S={(x1, y1), …,(xm, ym)};

argminα1


• For all i,

Find

0 ≤ αi ≤ Ci

yiαi = 0

i

+

+

+

+ - - -

-

-

+

- - -

-

w 𝑤 ⋅ 𝑥 = −1

𝑤 ⋅ 𝑥 = 1

Kernelizing the Dual SVMs

• Final classifier is: w = αiyixii

• The points xi for which αi ≠ 0 are called the “support vectors”

• With a kernel, classify x using αiyiK(x, xi)i

Input: S={(x1, y1), …,(xm, ym)};

argminα1


• For all i,

Find

0 ≤ αi ≤ Ci

yiαi = 0

i

Replace xi ⋅ xj

with K xi, xj .

One of the most theoretically well motivated and practically most effective classification algorithms in machine learning.

Support Vector Machines (SVMs).

Directly motivated by Margins and Kernels!

• The importance of margins in machine learning.

• The primal form of the SVM optimization problem

What you should know

• The dual form of the SVM optimization problem.

• Kernelizing SVM.

• Think about how it’s related to Regularized Logistic Regression.

• Using Unlabeled Data and Interaction for Learning

Modern (Partially) Supervised Machine Learning

Classic Paradigm Insufficient Nowadays

Modern applications: massive amounts of raw data.

Only a tiny fraction can be annotated by human experts.

Billions of webpages Images Protein sequences

http://images.google.com/imgres?imgurl=http://www.teachenglishinasia.net/files/u2/purple_lotus_flower.jpg&imgrefurl=http://www.teachenglishinasia.net/asiablog/asian-water-lilies-and-lotus-flowers&usg=__jQMsElfDOQGWm-hebjVtJDqL-40=&h=335&w=500&sz=28&hl=en&start=3&sig2=hmb0FR5kLCXdU2BdwjCNcA&zoom=1&itbs=1&tbnid=r_JpUcIxETAUoM:&tbnh=87&tbnw=130&prev=/images?q=flower&hl=en&gbv=2&tbs=isch:1&ei=Is1yTcKaOcGAlAe0wqGmAQ

http://images.google.com/imgres?imgurl=http://www.soccerballworld.com/images/Ball-UEFA-FINALE.jpg&imgrefurl=http://www.soccerballworld.com/UEFA-Champions-Finale.htm&usg=__IkP2FDq2F5yEyjqW0wDIthrifzc=&h=300&w=300&sz=15&hl=en&start=5&sig2=Sw_H6_M6TzTbVg5PqjP6lw&zoom=1&itbs=1&tbnid=E1Xl3YAzGDlvgM:&tbnh=116&tbnw=116&prev=/images?q=ball&hl=en&gbv=2&tbs=isch:1&ei=-8xyTbz0OISclge849RL

http://images.google.com/imgres?imgurl=http://ceoworld.biz/ceo/wp-content/uploads/2009/07/microsoft_logo.jpg&imgrefurl=http://ceoworld.biz/ceo/2009/07/13/microsoft-to-sell-razorfish-wpp-omnicom-publicis-groupe-interpublic-group-and-dentsu-possible-bidder&usg=__D-yYVRjhhUHbi2sYkh2hyP_5EK4=&h=299&w=300&sz=5&hl=en&start=2&sig2=4orOiNNbbV2i6Ii7SFMdmg&zoom=1&itbs=1&tbnid=Sk8jPaSR61onuM:&tbnh=116&tbnw=116&prev=/images?q=microsoft&hl=en&gbv=2&tbs=isch:1&ei=1MtyTbH9IYOglAeJrq2bAQ

http://images.google.com/imgres?imgurl=http://www.treehugger.com/ec-rnd-005.jpg&imgrefurl=http://www.treehugger.com/files/2008/07/17-electric-cars-overview-2005-to-2008.php&usg=__4G2QjNYXIHEYVXE6DoqIE5zzzrY=&h=322&w=468&sz=40&hl=en&start=9&sig2=P7jmw1vmKb715x8CELHezQ&zoom=1&itbs=1&tbnid=iGci7mkzT2KN2M:&tbnh=88&tbnw=128&prev=/images?q=car&hl=en&gbv=2&tbs=isch:1&ei=qMtyTcilGIOClAey9IGqCw

http://images.google.com/imgres?imgurl=http://images.nymag.com/news/features/newface080811_lede_560.jpg&imgrefurl=http://nymag.com/news/features/48948/&usg=__cimnczuBny94AVQhDb04KyNUIm4=&h=482&w=560&sz=50&hl=en&start=11&sig2=4wGA_vH60nBqlHqvekl3lA&zoom=1&itbs=1&tbnid=gg1XN95D7aTn3M:&tbnh=114&tbnw=133&prev=/images?q=face&hl=en&gbv=2&tbs=isch:1&ei=dMtyTbuKL8OqlAecwcFN



http://images.google.com/imgres?imgurl=http://www.flash-slideshow-maker.com/images/help_clip_image020.jpg&imgrefurl=http://www.flash-slideshow-maker.com/flash-images/&usg=__jMIGXrfx-_pXlchniUcjT2UYT9k=&h=422&w=554&sz=35&hl=en&start=1&sig2=YCfTn6kxhcufspbxJdXhrQ&zoom=1&itbs=1&tbnid=_ruOuQ2EjL3wOM:&tbnh=101&tbnw=133&prev=/images?q=images&hl=en&sa=X&gbv=2&tbas=0&tbs=isch:1&ei=n8hyTej5N8X_lgffrdVF

http://images.google.com/imgres?imgurl=http://www.dynamicdrive.com/dynamicindex4/lightbox2/flower.jpg&imgrefurl=http://www.dynamicdrive.com/dynamicindex4/lightbox2/index.htm&usg=__6G24ltZqq0UGAerNdiENN4jOqqQ=&h=509&w=500&sz=46&hl=en&start=4&sig2=P_pNfXebhTzd8n0DrpgNhw&zoom=1&itbs=1&tbnid=--IQZoA0t4jdOM:&tbnh=131&tbnw=129&prev=/images?q=images&hl=en&sa=X&gbv=2&tbas=0&tbs=isch:1&ei=n8hyTej5N8X_lgffrdVF

http://images.google.com/imgres?imgurl=http://www.nrl.navy.mil/content_images/clem.jpg&imgrefurl=http://www.nrl.navy.mil/media/popular-images/&usg=__Cj15wZbDIqpRCeVmO7SsP5XIOzM=&h=1213&w=1720&sz=1347&hl=en&start=3&sig2=TRnaOP-O9wUtTBwS0_vDZA&zoom=1&itbs=1&tbnid=EprKgSAflWdsYM:&tbnh=106&tbnw=150&prev=/images?q=images&hl=en&sa=X&gbv=2&tbas=0&tbs=isch:1&ei=n8hyTej5N8X_lgffrdVF

http://images.google.com/imgres?imgurl=http://hirise.lpl.arizona.edu/HiBlog/wp-content/uploads/2008/03/mex_phobos.jpg&imgrefurl=http://hirise.lpl.arizona.edu/HiBlog/category/hirise/news-events/special-images/&usg=__DLojkIcAQfQzGQ4oBrU4aT3OcHI=&h=400&w=400&sz=29&hl=en&start=15&sig2=cqA6Gj5KQHXSZIuOwlp0iQ&zoom=1&itbs=1&tbnid=QIVdyHeoWfEyFM:&tbnh=124&tbnw=124&prev=/images?q=images&hl=en&sa=X&gbv=2&tbas=0&tbs=isch:1&ei=n8hyTej5N8X_lgffrdVF

http://images.google.com/imgres?imgurl=http://www.mabima.net/company/images/Tucker_House_2003_640.JPG&imgrefurl=http://www.mabima.net/company/&usg=__yK2ELUzEORpde1XjSFJTrRDtFxc=&h=426&w=640&sz=69&hl=en&start=4&sig2=Q0e47SnpfYjO5jf5DQQk-Q&zoom=1&itbs=1&tbnid=GD54wp6WHlrhMM:&tbnh=91&tbnw=137&prev=/images?q=house&hl=en&gbv=2&tbs=isch:1&ei=KMtyTZnAE8aqlAegju1U

http://images.google.com/imgres?imgurl=http://www.popsci.com/files/imagecache/article_image_large/articles/ie2.jpg&imgrefurl=http://www.popsci.com/gadgets/article/2010-08/microsoft-fight-between-revenue-and-privacy-money-1-privacy-zero&usg=__wfaTvyWsRtJGRnAP5yJyb-kp3u0=&h=333&w=500&sz=22&hl=en&start=1&sig2=PTP24GqOtv3iVaxKrsl8dg&zoom=1&itbs=1&tbnid=_nE8aJVhvxfpbM:&tbnh=87&tbnw=130&prev=/images?q=internet&hl=en&gbv=2&tbs=isch:1&ei=FsxyTaDEPMWAlAePlcFS

Expert Labeler

Semi-Supervised Learning

raw data

face not face

Labeled data

Classifier

http://images.google.com/imgres?imgurl=http://static.howstuffworks.com/gif/house-selling-1.jpg&imgrefurl=http://home.howstuffworks.com/real-estate/house-selling.htm&usg=__qnE74MTnGIEKpohCSP6iTM9E8EA=&h=327&w=400&sz=24&hl=en&start=3&sig2=G6Bq69GdXS9x6hMpDk3HfA&zoom=1&itbs=1&tbnid=3fiW8K30_ygY1M:&tbnh=101&tbnw=124&prev=/images?q=house&hl=en&gbv=2&tbs=isch:1&ei=RF9sTcWkOsL88AbtremUBQ


http://images.google.com/imgres?imgurl=http://www.amptoons.com/blog/images/schiavo_ct_scan.jpg&imgrefurl=http://amptoons.com/blog/2005/03/20/regarding-the-cat-scan-of-terri-schiavos-brain/&usg=__3YqGoT0iFyPA8fbeIsWpFL9IXc8=&h=265&w=400&sz=15&hl=en&start=3&sig2=bly3Yc3vCh_1FCPEsUaJ-w&zoom=1&itbs=1&tbnid=JQHeSAZrAr7-nM:&tbnh=82&tbnw=124&prev=/images?q=ct+scan&hl=en&gbv=2&tbs=isch:1&ei=-c9yTYHOKoWdlgfB86Q3

http://images.google.com/imgres?imgurl=http://www.stanford.edu/~grg/images/orange_flower.jpg&imgrefurl=http://www.stanford.edu/~grg/photography/flowers.html&usg=__DguR48LSR6uf5bhuxI1v6nXpdow=&h=1413&w=1693&sz=280&hl=en&start=5&sig2=VOO3RyoRWfswRzJpi6CtjA&zoom=1&itbs=1&tbnid=kY7_4LKgNqDrbM:&tbnh=125&tbnw=150&prev=/images?q=flower&hl=en&gbv=2&tbs=isch:1&ei=Is1yTcKaOcGAlAe0wqGmAQ

http://images.google.com/imgres?imgurl=http://static.howstuffworks.com/gif/house-selling-1.jpg&imgrefurl=http://home.howstuffworks.com/real-estate/house-selling.htm&usg=__qnE74MTnGIEKpohCSP6iTM9E8EA=&h=327&w=400&sz=24&hl=en&start=3&sig2=G6Bq69GdXS9x6hMpDk3HfA&zoom=1&itbs=1&tbnid=3fiW8K30_ygY1M:&tbnh=101&tbnw=124&prev=/images?q=house&hl=en&gbv=2&tbs=isch:1&ei=RF9sTcWkOsL88AbtremUBQ


http://images.google.com/imgres?imgurl=http://www.socksoff.co.uk/00001/page05/Lone_Tree_1600.jpg&imgrefurl=http://www.hongkiat.com/blog/100-absolutely-beautiful-nature-wallpapers-for-your-desktop/&usg=__dsyO8OzB7rmE16btYKAa2WrNJtQ=&h=1200&w=1600&sz=693&hl=en&start=9&sig2=N22ozJ4mtT9DuMR9P-wNOg&zoom=1&itbs=1&tbnid=UU66VMJn9kPePM:&tbnh=113&tbnw=150&prev=/images?q=tree&hl=en&gbv=2&tbs=isch:1&ei=YsxyTearFsSclge1p_hf





http://images.google.com/imgres?imgurl=http://www.wallpaperbase.com/wallpapers/animals/fish/fish_4.jpg&imgrefurl=http://www.wallpaperbase.com/animals-fish.shtml&usg=__2GMQzBfy-zcwD5e3ceBpUaGJGsE=&h=768&w=1024&sz=115&hl=en&start=6&sig2=86Ql8saH12Fk6g3FoyAagw&zoom=1&itbs=1&tbnid=vPHzH3MqjP6xOM:&tbnh=113&tbnw=150&prev=/images?q=fish&hl=en&gbv=2&tbs=isch:1&ei=TstyTb3AB8Oclget89ChAw





Active Learning

face

O

O

O

Expert Labeler

raw data

Classifier

not face


http://images.google.com/imgres?imgurl=http://www.socksoff.co.uk/00001/page05/Lone_Tree_1600.jpg&imgrefurl=http://www.hongkiat.com/blog/100-absolutely-beautiful-nature-wallpapers-for-your-desktop/&usg=__dsyO8OzB7rmE16btYKAa2WrNJtQ=&h=1200&w=1600&sz=693&hl=en&start=9&sig2=N22ozJ4mtT9DuMR9P-wNOg&zoom=1&itbs=1&tbnid=UU66VMJn9kPePM:&tbnh=113&tbnw=150&prev=/images?q=tree&hl=en&gbv=2&tbs=isch:1&ei=YsxyTearFsSclge1p_hf





http://images.google.com/imgres?imgurl=http://www.wallpaperbase.com/wallpapers/animals/fish/fish_4.jpg&imgrefurl=http://www.wallpaperbase.com/animals-fish.shtml&usg=__2GMQzBfy-zcwD5e3ceBpUaGJGsE=&h=768&w=1024&sz=115&hl=en&start=6&sig2=86Ql8saH12Fk6g3FoyAagw&zoom=1&itbs=1&tbnid=vPHzH3MqjP6xOM:&tbnh=113&tbnw=150&prev=/images?q=fish&hl=en&gbv=2&tbs=isch:1&ei=TstyTb3AB8Oclget89ChAw







Prominent paradigm in past 15 years in Machine Learning.

• Most applications have lots of unlabeled data, but labeled data is rare or expensive:

• Web page, document classification

• Computer Vision

• Computational Biology,

• ….

Labeled Examples


Learning Algorithm

Expert / Oracle

Data Source

Unlabeled examples

Algorithm outputs a classifier

Unlabeled examples

Sl={(x1, y1), …,(xml , yml)}

Su={x1, …,xmu} drawn i.i.d from D

xi drawn i.i.d from D, yi = c∗(xi)

Goal: h has small error over D.

errD h = Prx~ D(h x ≠ c∗(x))

Unlabeled data useful if we have beliefs not only about

the form of the target, but also about its relationship

with the underlying distribution.

Key Insight

Semi-supervised learning: no querying. Just have lots of additional unlabeled data.

A bit puzzling since unclear what unlabeled data can do for you….

Combining Labeled and Unlabeled Data

• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:

– Transductive SVM [Joachims ’99]

– Co-training [Blum & Mitchell ’98]

– Graph-based methods [B&C01], [ZGL03]

Test of

time awards

at ICML! Workshops [ICML ’03, ICML’ 05, …]

• Semi-Supervised Learning, MIT 2006 O. Chapelle, B. Scholkopf and A. Zien (eds)

Books:

• Introduction to Semi-Supervised Learning, Morgan & Claypool, 2009 Zhu & Goldberg

Example of “typical” assumption: Margins

• The separator goes through low density regions of the space/large margin. – assume we are looking for linear separator

– belief: should exist one with large separation

+

+

_

_

Labeled data only

+

+

_

_

+

+

_

_

Transductive SVM SVM

Transductive Support Vector Machines

Optimize for the separator with large margin wrt labeled and unlabeled data. [Joachims ’99]

argminw w2 s.t.:

• yi w ⋅ xi ≥ 1, for all i ∈ {1, … ,ml}

Su={x1, …,xmu}

• yu w ⋅ xu ≥ 1, for all u ∈ {1,… ,mu}

• yu ∈ {−1, 1} for all u ∈ {1,… ,mu}

0

+ + + + -

- -

0

+

- - -

-

w’ 𝑤’ ⋅ 𝑥 = −1

𝑤’ ⋅ 𝑥 = 1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

0 0

0

0

0

Input: Sl={(x1, y1), …,(xml , yml)}

Find a labeling of the unlabeled sample and 𝑤 s.t. 𝑤 separates both labeled and unlabeled data with maximum margin.


argminw w2+ 𝐶 𝜉𝑖𝑖 + 𝐶 𝜉𝑢 𝑢

• yi w ⋅ xi ≥ 1-𝜉𝑖, for all i ∈ {1, … ,ml}

Su={x1, …,xmu}

• yi w ⋅ xu ≥ 1 − 𝜉𝑢 , for all u ∈ {1,… ,mu}

• yu ∈ {−1, 1} for all u ∈ {1,… ,mu}

0

+ + + + -

- -

0

+

- - -

-

w’ 𝑤’ ⋅ 𝑥 = −1

𝑤’ ⋅ 𝑥 = 1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

0 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

0 0

0

0

0


Find a labeling of the unlabeled sample and 𝑤 s.t. 𝑤 separates both labeled and unlabeled data with maximum margin.

Optimize for the separator with large margin wrt labeled and unlabeled data. [Joachims ’99]


Optimize for the separator with large margin wrt labeled and unlabeled data.

argminw w2+ 𝐶 𝜉𝑖𝑖 + 𝐶 𝜉𝑢 𝑢

• yi w ⋅ xi ≥ 1-𝜉𝑖, for all i ∈ {1, … ,ml}

Su={x1, …,xmu}

• yi w ⋅ xu ≥ 1 − 𝜉𝑢 , for all u ∈ {1,… ,mu}

• yu ∈ {−1, 1} for all u ∈ {1,… ,mu}


NP-hard….. Convex only after you guessed the labels… too many possible guesses…


Optimize for the separator with large margin wrt labeled and unlabeled data.

Heuristic (Joachims) high level idea:

Keep going until no more improvements. Finds a locally-optimal solution.

• First maximize margin over the labeled points

• Use this to give initial labels to unlabeled points based on this separator.

• Try flipping labels of unlabeled points to see if doing so can increase margin

Experiments [Joachims99]

Highly compatible +

+ _

_

Helpful distribution Non-helpful distributions


1/°2 clusters, all partitions separable by large margin


Prominent paradigm in past 15 years in Machine Learning.

Unlabeled data useful if we have beliefs not only about

the form of the target, but also about its relationship

with the underlying distribution.

Key Insight

– Transductive SVM [Joachims ’99]

– Co-training [Blum & Mitchell ’98]

– Graph-based methods [B&C01], [ZGL03]

Prominent techniques