46
1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar

Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

1

Computational Learning Theory: An Analysis of a Conjunction Learner

MachineLearningFall2017

SupervisedLearning:TheSetup

1

Machine LearningSpring 2018

The slides are mainly from Vivek Srikumar

Page 2: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

This section

1. Analyze a simple algorithm for learning conjunctions

2. Define the PAC model of learning

3. Make formal connections to the principle of Occam’s razor

2

Page 3: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions

Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

3

The true function

Page 4: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions

Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

A simple learning algorithm (Elimination)

• Discard all negative examples

4

Page 5: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions

Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A simple learning algorithm (Elimination)

• Discard all negative examples• Build a conjunction using the features

that are common to all positive conjunctions

5

Page 6: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions

Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A simple learning algorithm (Elimination)

• Discard all negative examples• Build a conjunction using the features

that are common to all positive conjunctions

6

Positive examples eliminate irrelevant features

Page 7: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions

Training data– <(1,1,1,1,1,1,…,1,1), 1>

– <(1,1,1,0,0,0,…,0,0), 0>

– <(1,1,1,1,1,0,...0,1,1), 1>

– <(1,0,1,1,1,0,...0,1,1), 0>

– <(1,1,1,1,1,0,...0,0,1), 1>

– <(1,0,1,0,0,0,...0,1,1), 0>

– <(1,1,1,1,1,1,…,0,1), 1>

– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A simple learning algorithm:

• Discard all negative examples• Build a conjunction using the features

that are common to all positive conjunctions

7

Clearly this algorithm produces a conjunction that is consistent with the data, that is errS(h) = 0, if the target function is a monotone conjunctionExercise: Why?

Page 8: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Claim 1: h will only make mistakes on positive future examplesWhy?

8

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Page 9: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Claim 1: h will only make mistakes on positive future examplesWhy?

9

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A mistake will occur only if some feature z (in our example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as negative by h

The reverse situation can never happen For an example to be predicted as positive in the training set, every relevant feature must have been present

Specifically: x1 = 0, x2 =1, x3=1, x4=1, x5=1, x100=1

Page 10: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Claim 1: h will only make mistakes on positive future examplesWhy?

10

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A mistake will occur only if some feature z (in our example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as negative by h

The reverse situation can never happen For an example to be predicted as positive in the training set, every relevant feature must have been present

Specifically: x1 = 0, x2 =1, x3=1, x4=1, x5=1, x100=1

Page 11: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Claim 1: h will only make mistakes on positive future examples

Why?

11

f

h++

-

-

-

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A mistake will occur only if some feature z (in our

example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as

negative by h

The reverse situation can never happen

For an example to be predicted as positive in the training set,

every relevant feature must have been present

Specifically: x1 = 0, x2 =1, x3=1, x4=1, x5=1, x100=1

Page 12: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Theorem: Suppose we are learning a conjunctive concept with n dimensional Boolean features using m training examples. If

then, with probability > 1 - !, the error of the learned hypothesis errD(h) will be less than ".

12

Page 13: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Theorem: Suppose we are learning a conjunctive concept with n dimensional Boolean features using m training examples. If

then, with probability > 1 - !, the error of the learned hypothesis errD(h) will be less than ".

13

If we see these many training examples, then the algorithm will produce a conjunction that, with high probability, will

make few errors

Poly in n, 1/!, 1/"

Page 14: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Theorem: Suppose we are learning a conjunctive concept with n dimensional Boolean features using m training examples. If

then, with probability > 1 - !, the error of the learned hypothesis errD(h) will be less than ".

14

Let’s prove this assertion

Page 15: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake?

Positive examples, where x1 is absentf would say true and h would say false

None of these examples appeared during trainingOtherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

15

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Page 16: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake?

Positive examples, where x1 is absentf would say true and h would say false

None of these examples appeared during trainingOtherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

16

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Page 17: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake?

Positive examples, where x1 is absentf would say true and h would say false

None of these examples appeared during trainingOtherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

17

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Page 18: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake?

Positive examples, where x1 is absentf would say true and h would say false

None of these examples appeared during trainingOtherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

18

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Page 19: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake?

Positive examples, where x1 is absentf would say true and h would say false

None of these examples appeared during trainingOtherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

19

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Page 20: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label

• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake

• For any z in the target function, p(z) = 0

20

Page 21: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label

• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake

• For any z in the target function, p(z) = 0

21

Remember that there will only be mistakes on positive examples for this toy problem

Page 22: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label

• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake

• For any z in the target function, p(z) = 0

22

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Remember that there will only be mistakes on positive examples for this toy problem

Page 23: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label

• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake

• For any z in the target function, p(z) = 0

23

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 <(0,1,1,1,1,0,...0,1,1), 1>

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Remember that there will only be mistakes on positive examples for this toy problem

Page 24: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label

• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake

• For any z in the target function, p(z) = 0

24

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 <(0,1,1,1,1,0,...0,1,1), 1>

p(x1): Probability that this situation occurs

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Remember that there will only be mistakes on positive examples for this toy problem

Page 25: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label

• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake

• For any z in the target function, p(z) = 0

We know that

Via direct application of the union bound

25

Page 26: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label

• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake

• For any z in the target function, p(z) = 0

We know that

Via direct application of the union bound

26

Page 27: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label

• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake

• For any z in the target function, p(z) = 0

We know that

Via direct application of the union bound

27

Union boundFor a set of events, probability that at least one of them happens < the sum of the probabilities of the individual events

Page 28: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability

of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)

If there are no bad features, then errD(h) < !– Why? Because

28

n = dimensionality

Let us try to see when this will not happen

Page 29: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability

of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)

If there are no bad features, then errD(h) < !– Why? Because

29

n = dimensionality

Let us try to see when this will not happen

Page 30: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability

of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)

If there are no bad features, then errD(h) < !– Why? Because

30

n = dimensionality

Let us try to see when this will not happen

Page 31: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability

of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)

What if there are bad features?Let z be a bad featureWhat is the probability that it will not be eliminated by one training example?

31

n = dimensionality

Page 32: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability

of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)

What if there are bad features?Let z be a bad featureWhat is the probability that it will not be eliminated by one training example?

32

n = dimensionality

Page 33: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability

of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)

What if there are bad features?Let z be a bad featureWhat is the probability that it will not be eliminated by one training example?

33

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1

2log |KBB + �A1|�

1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j

k(B,xij )(2yij � 1)N�k(B,xij )

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

+

Zlog

�N (v|0, k(B,B)

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

4

Page 34: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability

of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)

What if there are bad features?Let z be a bad featureWhat is the probability that it will not be eliminated by one training example?

34

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1

2log |KBB + �A1|�

1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j

k(B,xij )(2yij � 1)N�k(B,xij )

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

+

Zlog

�N (v|0, k(B,B)

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

4

<(1,1,1,1,1,0,...0,1,1), 1>There was one example of this kind

Page 35: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

What we know so far:

But say we have m training examples. Then

There are at most n bad features. So

35

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1

2log |KBB + �A1|�

1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j

k(B,xij )(2yij � 1)N�k(B,xij )

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

+

Zlog

�N (v|0, k(B,B)

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

Pr(A bad feature is not eliminated by one example) 1� ✏

n

4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1

2log |KBB + �A1|�

1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j

k(B,xij )(2yij � 1)N�k(B,xij )

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

+

Zlog

�N (v|0, k(B,B)

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

Pr(A bad feature is not eliminated by one example) 1� ✏

n

Pr(A bad feature survives m examples) �1� ✏

n

�m

4

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

Pr(A bad feature is not eliminated by one example) 1� ✏

n

Pr(A bad feature survives m examples) �1� ✏

n

�m

Pr(Any bad feature survives m examples) n�1� ✏

n

�m

5

Page 36: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

What we know so far:

But say we have m training examples. Then

There are at most n bad features. So

36

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1

2log |KBB + �A1|�

1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j

k(B,xij )(2yij � 1)N�k(B,xij )

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

+

Zlog

�N (v|0, k(B,B)

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

Pr(A bad feature is not eliminated by one example) 1� ✏

n

4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1

2log |KBB + �A1|�

1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j

k(B,xij )(2yij � 1)N�k(B,xij )

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

+

Zlog

�N (v|0, k(B,B)

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

Pr(A bad feature is not eliminated by one example) 1� ✏

n

Pr(A bad feature survives m examples) �1� ✏

n

�m

4

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

Pr(A bad feature is not eliminated by one example) 1� ✏

n

Pr(A bad feature survives m examples) �1� ✏

n

�m

Pr(Any bad feature survives m examples) n�1� ✏

n

�m

5

Page 37: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

What we know so far:

But say we have m training examples. Then

There are at most n bad features. So

37

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1

2log |KBB + �A1|�

1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j

k(B,xij )(2yij � 1)N�k(B,xij )

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

+

Zlog

�N (v|0, k(B,B)

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

Pr(A bad feature is not eliminated by one example) 1� ✏

n

4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1

2log |KBB + �A1|�

1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j

k(B,xij )(2yij � 1)N�k(B,xij )

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

+

Zlog

�N (v|0, k(B,B)

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

Pr(A bad feature is not eliminated by one example) 1� ✏

n

Pr(A bad feature survives m examples) �1� ✏

n

�m

4

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

Pr(A bad feature is not eliminated by one example) 1� ✏

n

Pr(A bad feature survives m examples) �1� ✏

n

�m

Pr(Any bad feature survives m examples) n�1� ✏

n

�m

5

Page 38: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

We want this probability to be small

Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some !

That is, we want

We know that 1 – x < e-x. So it is sufficient to require

38

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

Pr(A bad feature is not eliminated by one example) 1� ✏

n

Pr(A bad feature survives m examples) �1� ✏

n

�m

Pr(Any bad feature survives m examples) n�1� ✏

n

�m

5

Page 39: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

We want this probability to be small

Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some !

That is, we want

We know that 1 – x < e-x. So it is sufficient to require

39

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

Pr(A bad feature is not eliminated by one example) 1� ✏

n

Pr(A bad feature survives m examples) �1� ✏

n

�m

Pr(Any bad feature survives m examples) n�1� ✏

n

�m

5

Page 40: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

We want this probability to be small

Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some !

That is, we want

We know that 1 – x < e-x. So it is sufficient to require

40

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

Pr(A bad feature is not eliminated by one example) 1� ✏

n

Pr(A bad feature survives m examples) �1� ✏

n

�m

Pr(Any bad feature survives m examples) n�1� ✏

n

�m

5

Page 41: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

We want this probability to be small

Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some !

That is, we want

We know that 1 – x < e-x. So it is sufficient to require

Or equivalently,

41

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

Pr(A bad feature is not eliminated by one example) 1� ✏

n

Pr(A bad feature survives m examples) �1� ✏

n

�m

Pr(Any bad feature survives m examples) n�1� ✏

n

�m

5

Page 42: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is

That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !

What does this mean:• If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

42

Poly in n, 1/", 1/!

Page 43: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is

That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !

What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

43

Poly in n, 1/", 1/!

Page 44: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is

That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !

What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ! = 0.1 and " = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

44

Poly in n, 1/", 1/!

Page 45: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is

That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !

What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ! = 0.1 and " = 0.1, then for n = 10, we need only 461 examples• If ! = 0.1 and " = 0.01, then for n = 10, we need only 691 examples• If ! = 0.1 and " = 0.01, then for n = 10, we need 691 examples

45

Poly in n, 1/", 1/!

Page 46: Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1 Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning

Learning Conjunctions: Analysis

To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is

That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !

How to use this:• If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

46

What we have here is a PAC guarantee

Our algorithm is Probably Approximately Correct