44
Active Learning and Optimized Information Gathering Lecture 7 – Learning Theory CS 101.2 Andreas Krause

Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

Active Learning and

Optimized Information Gathering

Lecture 7 – Learning Theory

CS 101.2

Andreas Krause

Page 2: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

2

AnnouncementsProject proposal: Due tomorrow 1/27

Homework 1: Due Thursday 1/29

Any time is ok.

Office hours

Come to office hours before your presentation!

Andreas: Monday 3pm-4:30pm, 260 Jorgensen

Ryan: Wednesday 4:00-6:00pm, 109 Moore

Page 3: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

3

Recap Bandit Problems

Bandit problems

Online optimization under limited feedback

Exploration—Exploitation dilemma

Algorithms with low regret:

ε-greedy, UCB1

Payoffs can be

Probabilistic

Adversarial (oblivious / adaptive)

Page 4: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

4

More complex bandits

Bandits with many arms

Online linear optimization (online shortest paths …)

X-armed bandits (Lipschitz mean payoff function)

Gaussian process optimization (Bayesian assumptions about

mean payoffs)

Bandits with state

Contextual bandits

Reinforcement learning

Key tool: Optimism in the face of uncertainty

Page 5: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

5

Course outline1. Online decision making

2. Statistical active learning

3. Combinatorial approaches

Page 6: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

6

Spam or Ham?

Labels are expensive (need to ask expert)

Which labels should we obtain to maximize

classification accuracy?

+

o

oo

o

o

o

o

oo

o o

o

o

o

+

Spam

Ham

+–

x1

x2

label = sign(w0 + w1 x1 + w2 x2)

(linear separator)

Page 7: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

7

Outline

Background in learning theory

Sample complexity

Key challenges

Heuristics for active learning

Principled algorithms for active learning

Page 8: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

8

Credit scoring

???50

082

136

142

070

Defaulted?Credit score

Want decision rule that performs well

for unseen examples (generalization)

Page 9: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

9

More general: Concept learningSet X of instances

True concept c: X 0,1

Hypothesis h: X 0,1

Hypothesis space H = h1, …, hn, …

Want to pick good hypothesis

(agrees with true concept on most instances)

Page 10: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

10

Example: Binary thresholdsInput domain: X=1,2,…,100

True concept c:

c(x) = +1 if x≥ t

c(x) = -1 if x < t

- - - - + + + +

1001 Threshold t

Page 11: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

11

How good is a hypothesis?Set X of instances, concept c: X 0,1

Hypothesis h: X 0,1 , H = h1, …, hn, …

Distribution PX over X

errortrue(h) =

Want h* = argminh∈ H errortrue(h)

Can’t compute errortrue(h)!

Page 12: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

12

Concept learningData set D = (x1,y1),…,(xN,yN), xi ∈ X, yi ∈ 0,1

Assume xi drawn independently from PX; yi = c(xi)

Also assume c ∈ H

h consistent with D

More data fewer consistent hypotheses

Learning strategy:

Collect “enough” data

Output consistent hypothesis h

Hope that errortrue(h) is small

Page 13: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

13

Sample complexityLet ε>0

How many samples do we need s.t. all consistent

hypotheses have error< ε??

Def: h ∈ H bad

Suppose h∈ H is bad. Let x ∼ PX, y = c(x).

Then:

Page 14: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

14

Sample complexityP( h bad and “survives” 1 data point) =

P( h bad and “survives” n data points) =

P( remains ≥ 1 bad h after n data points) =

Page 15: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

15

Probability of bad hypothesis

Page 16: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

16

Sample complexity

for finite hypothesis spaces [Haussler ’88]

Theorem: Suppose

|H| < ∞,

Data set |D|=n drawn i.i.d. from PX (no noise)

0<ε <1

Then for any h ∈ H consistent with D:

P( errortrue(h) > ε) ≤ |H| exp(-ε n)

“PAC-bound” (probably approximately correct)

Page 17: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

17

How can we use this result?

P( errortrue(h) ≥ ε ) ≤ |H| exp(-ε n) = δ

Possibilities:

Given δ, n solve for ε

Given ε and δ, solve for n

(Given ε, n, solve for δ)

Page 18: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

18

Example: Credit scoringX = 1,2,…1000

H = binary thresholds on X

|H| =

Want error ≤ 0.01 with probability .999

Need n ≥ 1382 samples

Page 19: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

19

LimitationsHow do we find consistent hypothesis?

What if |H| = ∞?

What if there’s noise in the data? (or c ∉ H)

Page 20: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

20

Credit scoring

???44

081

070

152

048

136

Defaulted?Credit score

No binary threshold function explains

this data with 0 error

Page 21: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

21

Noisy dataSets of instances X and labels Y = 0,1

Suppose (X,Y)∼ PXY

Hypothesis space H

errortrue(h) = Ex,y[ |h(x) − y| ]

Want to find

argminh∈ H errortrue(h)

Page 22: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

22

Learning from noisy dataSuppose D = (x1,y1),…,(xn,yn) where (xi,yi) ∼ PX,Y

errortrain(h) = (1/n) ∑i |h(xi) − yi|

Learning strategy with noisy data

Collect “enough“ data

Output h’ = argminh∈ H errortrain(h)

Hope that errortrue(h’) ≈minh ∈H errortrue(h)

Page 23: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

23

Estimating errorHow many samples do we need to accurately estimate the true error?

Data set D = (x1,y1),…,(xn,yn) where (xi,yi) ∼ PX,Y

zi = |h(xi) - yi | ∈ 0,1

zi are i.i.d. samples from Bernoulli RV Z = |h(X) - Y|

errortrain(h) =

errortrue(h) =

How many samples s.t. |errortrain(h) – errortrue(h)| is small??

Page 24: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

24

Estimating errorHow many samples do we need to accurately

estimate the true error?

Applying Chernoff-Hoeffding bound:

P( |errortrue(h) – errortrain(h)| ≥ ε) ≤ exp(-2n ε2)

Page 25: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

25

Sample complexity with noiseCall h∈ H bad if

errortrue(h) > errortrain(h) + ε

P(h bad “survives” n training examples) ≤ exp(-2 n ε2)

P(remains≥ 1 bad h after n examples) ≤

Page 26: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

26

PAC Bound for noisy dataTheorem: Suppose

|H| < ∞,

Data set |D|=n drawn i.i.d. from PXY

0<ε <1

Then for any h ∈ H it holds that

errortrue(h) ≤ errortrain(h) +

√log |H|+ log 1/δ

2n

Page 27: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

27

PAC Bounds: Noise vs. no noiseWant error ≤ ε with probability 1-δ

No noise: n ≥ 1/ε ( log |H| + log 1/δ )

Noise: n ≥ 1/ε2 ( log |H| + log 1/δ )

Page 28: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

28

LimitationsHow do we find consistent hypothesis?

What if |H| = ∞?

What if there’s noise in the data? (or c ∉ H)

Page 29: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

29

Credit scoring

???44.3141

081.3321

070.1111

152.3847

148.7983

136.1200

Defaulted?Credit score

Want to classify continuous instance space

|H| = ∞∞∞∞

Page 30: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

30

Large hypothesis spacesIdea: Labels of few data points imply labels of many

unlabeled data points

Page 31: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

31

How many points can be arbitrarily

classified using binary thresholds?

Page 32: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

32

How many points can be arbitrarily

classified using linear separators? (1D)

Page 33: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

33

How many points can be arbitrarily

classified using linear separators? (2D)

Page 34: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

34

VC dimensionLet S ⊆ X be a set of instances

A Dichotomy is a nontrivial partition of S = S1 ∪ S0

S is shattered by hypothesis space H if for any

dichotomy, there exists a consistent hypothesis h

(i.e., h(x)=1 if x∈ S1 and h(x)=0 if x∈ S0)

The VC (Vapnik-Chervonenkis) dimension VC(H) of H

is the size of the largest set S shattered by H

(possibly ∞)

VC(H) ≤

Page 35: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

35

VC Generalization bound

errortrue(h) ≤ errortrain(h) +

√log |H|+ log 1/δ

2n

errortrue(h) ≤ errortrain(h) +

√√√√V C(H)(1 + log 2n

V C(H)

)+ log 4

δ

n

Bound for finite hypothesis spaces

VC-dimension based bound

Page 36: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

36

ApplicationsAllows to prove generalization bounds for large

hypothesis spaces with structure.

For many popular hypothesis classes, VC dimension

known

Binary thresholds

Linear classifiers

Decision trees

Neural networks

Page 37: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

37

Passive learning protocol

Data source PX,Y (produces inputs xi and labels yi)

Data set Dn = (x1,y1),…,(xn,yn)

Learner outputs hypothesis h

errortrue(h) = Ex,y |h(x) − y|

Page 38: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

38

From passive to active learning

+

o

oo

o

o

o

o

oo

o o

o

o

o

+

Spam

Ham

+–

Some labels “more informative” than others

Page 39: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

39

Statistical passive/active learning protocol

Data source PX (produces inputs xi)

data set Dn = (x1,y1),…,(xn,yn)

Learner outputs hypothesis h

errortrue(h) = Ex~P[h(x) ≠ c(x)]

Active learner assembles

by selectively obtaining labels

Page 40: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

40

Passive learningInput domain: D=[0,1]

True concept c:

c(x) = +1 if x≥ t

c(x) = -1 if x < t

Passive learning:

Acquire all labels yi∈ +,-

10 Threshold t

Page 41: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

41

Active learningInput domain: D=[0,1]

True concept c:

c(x) = +1 if x≥ t

c(x) = -1 if x < t

Passive learning:

Acquire all labels yi∈ +,-

Active learning:

Decide which labels to obtain

10 Threshold t

Page 42: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

42

Comparison

Active learning can exponentially reduce the number

of required labels!

O(log 1/ε)Active learning

Ω(1/ε)Passive learning

Labels needed to learn

with classification

error ε

Page 43: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

43

Key challengesPAC Bounds we’ve seen so far crucially depend on

i.i.d. data!!

Actively assembling data set causes bias!

If we’re not careful, active learning can do worse!

Page 44: Active Learning and Optimized Information Gatheringcourses.cms.caltech.edu/cs101.2/slides/cs101.2-07... · Lecture 7 – Learning Theory CS 101.2 Andreas Krause. 2 Announcements Project

44

What you need to knowConcepts, hypotheses

PAC bounds (probably approximate correct)

For noiseless (“realizable”) case

For noisy (“unrealizable”) case

VC dimension

Active learning protocol