63
Semi-Supervised Learning William Cohen

Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Semi-Supervised Learning

William Cohen

Page 2: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Outline •  The general idea and an example (NELL) •  Some types of SSL

– Margin-based: transductive SVM • Logistic regression with entropic regularization

– Generative: seeded k-means – Nearest-neighbor like: graph-based SSL

Page 3: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

INTRO TO SEMI-SUPERVISED LEARNING �(SSL)

Page 4: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Semi-supervised learning •  Given:

– A pool of labeled examples L – A (usually larger) pool of unlabeled examples U

•  Option 1 for using L and U : –  Ignore U and use supervised learning on L

•  Option 2: –  Ignore labels in L+U and use k-means, etc Uind

clusters; then label each cluster using L •  Question:

– Can you use both L and U to do better?

Page 5: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL is Somewhere Between Clustering and Supervised Learning

5

Page 6: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL is Between Clustering and SL

6

Page 7: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

What is a natural grouping among these objects?

slides: Bhavana Dalvi

Page 8: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL is Between Clustering and SL

8

clustering is unconstrained and may not give you what you want

maybe this clustering is as good as the other

Page 9: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL is Between Clustering and SL

9

Page 10: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL is Between Clustering and SL

10

Page 11: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL is Between Clustering and SL

11

supervised learning with few labels is also unconstrained and may not give

you what you want

Page 12: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL is Between Clustering and SL

12

Page 13: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL is Between Clustering and SL

13

This clustering isn’t consistent with the labels

Page 14: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL is Between Clustering and SL

14

|Predicted Green|/|U| ~= 50%

Page 15: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL in Action: The NELL System

Page 16: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Outline •  The general idea and an example (NELL) •  Some types of SSL

– Margin-based: transductive SVM • Logistic regression with entropic regularization

– Generative: seeded k-means – Nearest-neighbor like: graph-based SSL

Page 17: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

TRANSDUCTIVE SVM

Page 18: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Two Kinds of Learning •  Inductive SSL:

–  Input: training set •  (x1,y1),…,(xn,yn) •  xn+1 , xn+2 ,…,xn+m

–  Output: classiUier •  f(x) = y

–  ClassiUier can be run on any test example x

•  Transductive SSL: –  Input: training set

•  (x1,y1),…,(xn,yn) •  xn+1 , xn+2 ,…,xn+m

–  Output: classiUier •  f(xi) = y

–  ClassiUier is only deUined for xi’s seen at training time

Page 19: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option
Page 20: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Standard SVM

Page 21: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Tranductive SVM

Page 22: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Not a convex problem – need to do some sort of search to guess the labels for the unlabeled examples

Page 23: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL using regularized SGD for logistic regression 1.  P(y|x)=logis*c(x.w)2.  Definelossfunc*on

3.  Differen*atethefunc*onandusegradientdescenttolearn

LCLD (w) ≡ logP(yi | xi,w)i∑ −µ w

2

2

23

Page 24: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SSL using regularized SGD for logistic regression 1.  P(y|x)=logis*c(x.w)2.  Definelossfunc*on

3.  Differen*atethefunc*onandusegradientdescenttolearn

LCLD (w) ≡ logP(yi | xi,w)i∑ −µ w

2

2

− P(y ' | x j,w)y '∑

j∑ logP(y ' | x j,w)

24

Entropy of predictions

on the unlabeled examples

Page 25: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Again, a convex problem – need to do some sort of search to guess the labels for the unlabeled examples

Logistic regression with entropic regularization

Low entropy example:

very confident in

one class

High entropy example:

high probability of

being in either class

Page 26: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

SEMI-SUPERVISED�K-MEANS AND MIXTURE MODELS

Page 27: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

k-means

Page 28: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

K-means Clustering: Step 1

28

Page 29: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

K-means Clustering: Step 2

29

Page 30: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

K-means Clustering: Step 3

30

Page 31: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

K-means Clustering: Step 4

31

Page 32: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

K-means Clustering: Step 5

32

Page 33: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option
Page 34: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Seeded k-means

k is the number of classes using the labeled “seed” data

except keep the seeds in the class they are known to belong to

Page 35: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Basu and Mooney ICML 2002

Unsupervised k-means

Seeded k-means (constrained k-means) Just use labeled data to

initialize the clusters

Some old baseline I’m not going to even talk

about

20 Newsgroups dataset

Page 36: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Outline •  The general idea and an example (NELL) •  Some types of SSL

– Margin-based: transductive SVM • Logistic regression with entropic regularization

– Generative: seeded k-means •  Some recent extensions….

– Nearest-neighbor like: graph-based SSL

Page 37: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Seeded k-means for a hierarchical classiUication tasks

37

Simple extension: 1.  Don’t assign to one of K classes: instead make a

decision about every class in the ontology •  example à {1,..K} example à 00010001

2.  Pick “closest” bit vector consistent with constraints •  this is an (ontology-sized) optimization problem

that you solve independently for each example

Bit vector with one

bit for each category

Page 38: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Seeded k-means

k is the number of classes using the labeled “seed” data

except keep the seeds in the classes they are known to belong to

best consistent set of categories from the ontology

Page 39: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Automatic Gloss Finding �for a Knowledge Base

39

•  Glosses: Natural language deUinitions of named entities. E.g. “Microsoft” is an American multinational corporation headquartered in Redmond that develops, manufactures, licenses, supports and sells computer software, consumer electronics and personal computers and services ...

–  Input: Knowledge Base i.e. a set of concepts (e.g. company) and entities belonging to those concepts (e.g. Microsoft), and a set of potential glosses.

–  Output: Candidate glosses matched to relevant entities in the KB.�“Microsoft is an American multinational corporation headquartered in Redmond …” is mapped to entity “Microsoft” of type “Company”.

[Automatic Gloss Finding for a Knowledge Base using Ontological Constraints, Bhavana Dalvi Mishra, Einat Minkov, Partha Pratim Talukdar, and William W. Cohen, 2014, Under submission]

Page 40: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Example: Gloss Uinding

40

Page 41: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Example: Gloss Uinding

41

Page 42: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Example: Gloss Uinding

42

Page 43: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Example: Gloss Uinding

43

Page 44: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Training a clustering model

Labeled: Unambiguous

Unlabeled: Ambiguous

44

Fruit Company

Page 45: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

GLOFIN: Clustering glosses

45

Page 46: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

GLOFIN: Clustering glosses

46

Page 47: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

GLOFIN: Clustering glosses

47

Page 48: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

GLOFIN: Clustering glosses

48

Page 49: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

GLOFIN: Clustering glosses

49

Page 50: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

GLOFIN: Clustering glosses

50

Page 51: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

GLOFIN on NELL Dataset

51

0

10

20

30

40

50

60

70

80

Precision Recall F1

SVM

Label Propagation

GLOFIN

275 categories, 247K candidate glosses, #train=20K, #test=227K

Page 52: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Outline •  The general idea and an example (NELL) •  Some types of SSL

– Margin-based: transductive SVM • Logistic regression with entropic regularization

– Generative: seeded k-means – Nearest-neighbor like: graph-based SSL

Page 53: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Harmonic fields – Gharamani, Lafferty and Zhu

Observed label

•  Idea: construct a graph connecting the most similar examples (k-NN graph)

•  Intuition: nearby points should have similar labels – labels should “propagate” through the graph

•  Formalization: try and minimize “energy” deUined as:

In this example y

is a length-10

vector

Page 54: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Harmonic fields – Gharamani, Lafferty and Zhu

Observed label

•  Result 1: at the minimal energy state, each node’s value is a weighted average of its neighbor’s weights:

Page 55: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

“Harmonic Uield” LP algorithm •  Result 2: you can reach the minimal energy

state with a simple iterative algorithm: – Step 1: For each seed example (xi,yi):

• Let V0(i,c) = [| yi= c |] – Step 2: for t=1,…,T --- T is about 5

• Let Vt+1(i,c) =weighted average of Vt+1(j,c) for all j that are linked to i, and renormalize

• For seeds, reset Vt+1(i,c) = [| yi= c |]

V t+1(i,c) =1Z

wi, jVt ( j,c)

j∑

Page 56: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

This family of techniques is called “Label propagation”

Harmonic fields – Gharamani, Lafferty and Zhu

Page 57: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

This family of techniques is called “Label propagation”

Harmonic fields – Gharamani, Lafferty and Zhu

This experiment points out some of the issues with LP: 1.  What distance metric

do you use? 2.  What energy function

do you minimize? 3.  What is the right value

for K in your K-NN graph? Is a K-NN graph right?

4.  If you have lots of data, how expensive is it to build the graph?

Page 58: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

NELL: Uses Co-EM ~= HF

Paris Pittsburgh

Seattle Cupertino

mayor of arg1 live in arg1

San Francisco Austin denial

arg1 is home of traits such as arg1

anxiety selfishness

Berlin

Extract cities: Examples

Features

Page 59: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Semi-Supervised Bootstrapped �Learning via Label Propagation

Paris

live in arg1

San Francisco Austin

traits such as arg1

anxiety

mayor of arg1

Pittsburgh

Seattle

denial

arg1 is home of

selfishness

Page 60: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Semi-Supervised Bootstrapped Learning via Label Propagation

Paris

live in arg1

San Francisco Austin

traits such as arg1

anxiety

mayor of arg1

Pittsburgh

Seattle

denial

arg1 is home of

selfishness

Nodes “near” seeds Nodes “far from” seeds

Information from other categories tells you “how far” (when to stop propagating)

arrogance traits such as arg1

denial selfishness

Page 61: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Difference: graph construction is not instance-to-instance but instance-to-feature

Paris San Francisco

Austin

anxiety Pittsburgh

Seattle

denial selfishness

Page 62: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Some other general issues with SSL •  How much unlabeled data do you want?

– Suppose you’re optimizing J = JL(L) + JU(U) – If |U| >> |L| does JU dominate J?

•  If so you’re basically just clustering – Often we need to balance JL and JU

•  Besides L, what other information about the task is useful (or necessary)? – Common choice: relative frequency of classes – Various ways of incorporating this into the

optimization problem

Page 63: Semi-Supervised Learningwcohen/10-601/ssl.pdf · Semi-supervised learning • Given: – A pool of labeled examples L – A (usually larger) pool of unlabeled examples U • Option

Key and not-so-key points •  The general idea : what is SSL and when do you want to use

it? –  NELL as an example of SSL

•  Different SSL methods: –  margin-based approach: start with a supervised learner

•  transductive SVM: what’s optimized and why •  logistic reg with entropic regularization

–  k-means versus seeded k-means: start with clustering •  The core algorithm and what •  Extension to hierarchical case and GLOFIN

–  nearest-neighbor like: graph-based SSL and LP •  The HF algorithm and the energy function being minimized