51
Cooperating Intelligent Systems Learning from observations Chapter 18, AIMA

Cooperating Intelligent Systems

  • Upload
    dea

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Cooperating Intelligent Systems. Learning from observations Chapter 18, AIMA. Two types of learning in AI. Deductive : Deduce rules/facts from already known rules/facts. (We have already dealt with this) Inductive : Learn new rules/facts from a data set D. - PowerPoint PPT Presentation

Citation preview

Page 1: Cooperating Intelligent Systems

Cooperating Intelligent Systems

Learning from observationsChapter 18, AIMA

Page 2: Cooperating Intelligent Systems

Two types of learning in AI

Deductive: Deduce rules/facts from already known rules/facts. (We have already dealt with this)

Inductive: Learn new rules/facts from a data set D.

CACBA

CAnyn Nn ...1)(),(xDWe will be dealing with the latter, inductive learning, now

Page 3: Cooperating Intelligent Systems

Two types of inductive learning

Supervised: The machine has access to a teacher who corrects it.

Unsupervised: No access to teacher. Instead, the machine must search for “order” and “structure” in the environment.

Page 4: Cooperating Intelligent Systems

Inductive learning - example A

• f(x) is the target function• An example is a pair [x, f(x)]• Learning task: find a hypothesis h such that h(x) f(x) given a

training set of examples D = {[xi, f(xi) ]}, i = 1,2,…,N

1)(,

001

01

0111

xx f

1)(,

00111

0011

xx f

0)(,

01

011

0011

xx f

Etc...

Inspired by a slide from V. Pavlovic

Page 5: Cooperating Intelligent Systems

Inductive learning – example B

• Construct h so that it agrees with f.• The hypothesis h is consistent if it agrees with f on all

observations.• Ockham’s razor: Select the simplest consistent hypothesis.

• How achieve good generalization?

Consistent linear fit Consistent 7th order polynomial fit

Inconsistent linear fit.Consistent 6th orderpolynomial fit.

Consistent sinusoidal fit

Page 6: Cooperating Intelligent Systems

Inductive learning – example C

x

y

Example from V. Pavlovic @ Rutgers

Page 7: Cooperating Intelligent Systems

Inductive learning – example C

x

y

Example from V. Pavlovic @ Rutgers

Page 8: Cooperating Intelligent Systems

Inductive learning – example C

x

y

Example from V. Pavlovic @ Rutgers

Page 9: Cooperating Intelligent Systems

Inductive learning – example C

x

y

Example from V. Pavlovic @ Rutgers

Sometimes a consistent hypothesis is worse than an inconsistent

Page 10: Cooperating Intelligent Systems

The idealized inductive learning problem

Our hypothesis space

f(x)

hopt(x) H

Error

Find appropriate hypothesis space H and findh(x) H with minimum “distance” to f(x) (“error”)

The learning problem is realizable if f(x) ∈ H.

Page 11: Cooperating Intelligent Systems

Data is never noise free and never available in infinite amounts, so we get variation in data and model.The generalization error is a function of both the training data and the hypothesis selection method.

Find appropriate hypothesis space H and minimize the expected distance to f(x) (“generalization error”)

{f(x)}

{hopt(x)}

Egen

The real inductive learning problem

Page 12: Cooperating Intelligent Systems

Hypothesis spaces (examples)

f(x) = 0.5 + x + x2 + 6x3123

1={a+bx}; 2={a+bx+cx2}; 3={a+bx+cx2+dx3};Linear; Quadratic; Cubic;

1 2 3

Page 13: Cooperating Intelligent Systems

Learning problems• The hypothesis takes as input a set of

attributes x and returns a ”decision” h(x) = the predicted (estimated) output value for the input x.

• Discrete valued function ⇒ classification• Continuous valued function ⇒ regression

Page 14: Cooperating Intelligent Systems

ClassificationOrder into one out of several classes

KD CX Input space Output (category) space

D

D

X

x

xx

2

1

x K

K

C

c

cc

0

10

2

1

c

Page 15: Cooperating Intelligent Systems

Example: Robot color vision

Classify the Lego pieces into red, blue, and yellow.Classify white balls, black sideboard, and green carpet.Input = pixel in image, output = category

Page 16: Cooperating Intelligent Systems

RegressionThe “fixed regressor model”

)()( xx gf

x Observed inputf(x) Observed outputg(x) True underlying function I.I.D noise process

with zero mean

Page 17: Cooperating Intelligent Systems

Example: Predict price for cotton futures

Input: Past historyof closing prices,and trading volume

Output: Predictedclosing price

Page 18: Cooperating Intelligent Systems

Decision trees

• “Divide and conquer”: Split data into smaller and smaller subsets.

• Splits usually on a single variable

x1 > ?

yes no

x2 > ? x2 > ?

yes yesno no

Page 19: Cooperating Intelligent Systems

The wait@restaurant decision treeThis is our true function.Can we learn this tree from examples?

Page 20: Cooperating Intelligent Systems

Inductive learning of decision tree• Simplest: Construct a decision tree with one leaf

for every example = memory based learning.Not very good generalization.

• Advanced: Split on each variable so that the purity of each split increases (i.e. either only yes or only no)

• Purity measured,e.g, with entropy

Page 21: Cooperating Intelligent Systems

Inductive learning of decision tree• Simplest: Construct a decision tree with one leaf

for every example = memory based learning.Not very good generalization.

• Advanced: Split on each variable so that the purity of each split increases (i.e. either only yes or only no)

• Purity measured,e.g, with entropy

Page 22: Cooperating Intelligent Systems

Inductive learning of decision tree• Simplest: Construct a decision tree with one leaf

for every example = memory based learning.Not very good generalization.

• Advanced: Split on each variable so that the purity of each split increases (i.e. either only yes or only no)

• Purity measured,e.g, with entropy

)](ln[)()](ln[)(Entropy noPnoPyesPyesP

i

ii vPvP )(ln)(EntropyGeneral form:

Page 23: Cooperating Intelligent Systems

The entropy is maximal whenall possibilities are equallylikely.

The goal of the decision treeis to decrease the entropy ineach node.

Entropy is zero in a pure ”yes”node (or pure ”no” node).

The second law of thermodynamics:Elements in a closed system tend to seek their most probable distribution; in a closed system entropy always increases

Entropy is a measure of ”order” in asystem.

Page 24: Cooperating Intelligent Systems

Decision tree learning algorithm• Create pure nodes whenever possible• If pure nodes are not possible, choose the

split that leads to the largest decrease in entropy.

Page 25: Cooperating Intelligent Systems

Decision tree learning example10 attributes:1. Alternate: Is there a suitable alternative restaurant

nearby? {yes,no}2. Bar: Is there a bar to wait in? {yes,no}3. Fri/Sat: Is it Friday or Saturday? {yes,no}4. Hungry: Are you hungry? {yes,no}5. Patrons: How many are seated in the restaurant? {none,

some, full}6. Price: Price level {$,$$,$$$}7. Raining: Is it raining? {yes,no}8. Reservation: Did you make a reservation? {yes,no}9. Type: Type of food {French,Italian,Thai,Burger}10. Wait: {0-10 min, 10-30 min, 30-60 min, >60 min}

Page 26: Cooperating Intelligent Systems

Decision tree learning example

T = True, F = False 6 True,6 False 30.012

6ln126

126ln12

6Entropy

Page 27: Cooperating Intelligent Systems

Decision tree learning example

30.063ln6

36

3ln63

126

63ln6

36

3ln63

126Entropy

Alternate?

3 T, 3 F 3 T, 3 F

Yes No

Entropy decrease = 0.30 – 0.30 = 0

Page 28: Cooperating Intelligent Systems

Decision tree learning example

30.063ln6

36

3ln63

126

63ln6

36

3ln63

126Entropy

Bar?

3 T, 3 F 3 T, 3 F

Yes No

Entropy decrease = 0.30 – 0.30 = 0

Page 29: Cooperating Intelligent Systems

Decision tree learning example

29.073ln7

37

4ln74

127

53ln5

35

2ln52

125Entropy

Sat/Fri?

2 T, 3 F 4 T, 3 F

Yes No

Entropy decrease = 0.30 – 0.29 = 0.01

Page 30: Cooperating Intelligent Systems

Decision tree learning example

24.054ln5

45

1ln51

125

72ln7

27

5ln75

127Entropy

Hungry?

5 T, 2 F 1 T, 4 F

Yes No

Entropy decrease = 0.30 – 0.24 = 0.06

Page 31: Cooperating Intelligent Systems

Decision tree learning example

30.084ln8

48

4ln84

128

42ln4

24

2ln42

124Entropy

Raining?

2 T, 2 F 4 T, 4 F

Yes No

Entropy decrease = 0.30 – 0.30 = 0

Page 32: Cooperating Intelligent Systems

Decision tree learning example

29.074ln7

47

3ln73

127

52ln5

25

3ln53

125Entropy

Reservation?

3 T, 2 F 3 T, 4 F

Yes No

Entropy decrease = 0.30 – 0.29 = 0.01

Page 33: Cooperating Intelligent Systems

Decision tree learning example

14.06

4ln64

62ln6

2126

40ln4

04

4ln44

124

22ln2

22

0ln20

122Entropy

Patrons?

2 F4 T

None Full

Entropy decrease = 0.30 – 0.14 = 0.16

2 T, 4 FSome

Page 34: Cooperating Intelligent Systems

Decision tree learning example

23.04

3ln43

41ln4

1124

20ln2

02

2ln22

122

63ln6

36

3ln63

126Entropy

Price

3 T, 3 F2 T

$ $$$

Entropy decrease = 0.30 – 0.23 = 0.07

1 T, 3 F$$

Page 35: Cooperating Intelligent Systems

Decision tree learning example

30.04

2ln42

42ln4

2124

42ln4

24

2ln42

124

21ln2

12

1ln21

122

21ln2

12

1ln21

122Entropy

Type

1 T, 1 F

1 T, 1 F

French Burger

Entropy decrease = 0.30 – 0.30 = 0

2 T, 2 FItalian

2 T, 2 F

Thai

Page 36: Cooperating Intelligent Systems

Decision tree learning example

24.02

2ln22

20ln2

0122

21ln2

12

1ln21

122

21ln2

12

1ln21

122

62ln6

26

4ln64

126Entropy

Est. waitingtime

4 T, 2 F

1 T, 1 F

0-10 > 60

Entropy decrease = 0.30 – 0.24 = 0.06

2 F10-30

1 T, 1 F

30-60

Page 37: Cooperating Intelligent Systems

Decision tree learning examplePatrons?

2 F4 T

None Full

Largest entropy decrease (0.16)achieved by splitting on Patrons.

2 T, 4 FSome

X? Continue like this, making new splits, always purifying nodes.

Page 38: Cooperating Intelligent Systems

Decision tree learning exampleInduced tree (from examples)

Page 39: Cooperating Intelligent Systems

Decision tree learning exampleTrue tree

Page 40: Cooperating Intelligent Systems

Decision tree learning exampleInduced tree (from examples)

Cannot make it more complexthan what the data supports.

Page 41: Cooperating Intelligent Systems

How do we know it is correct?How do we know that h f ?

(Hume's Problem of Induction)– Try h on a new test set of examples

(cross validation)

...and assume the ”principle of uniformity”, i.e. the result we get on this test data should be indicative of results on future data. Causality is constant.

Inspired by a slide by V. Pavlovic

Page 42: Cooperating Intelligent Systems

Learning curve for the decision tree algorithm on 100 randomlygenerated examples in the restaurant domain.The graph summarizes 20 trials.

Page 43: Cooperating Intelligent Systems

Cross-validationUse a “validation set”.

Dtrain

Dval

Eval

valgen EE

Split your data set into twoparts, one for training yourmodel and the other for validating your model.The error on the validation data is called “validation error”(Eval)

Page 44: Cooperating Intelligent Systems

K-Fold Cross-validationMore accurate than using only one validation set.

Dtrain

Dval

Dtrain

Dtrain

Dtrain

Dval

Dval

Eval(1) Eval(2) Eval(3)

K

kvalvalgen kE

KEE

1

)(1

Page 45: Cooperating Intelligent Systems

PAC• Any hypothesis that is consistent with a

sufficiently large set of training (and test) examples is unlikely to be seriously wrong; it is probably approximately correct (PAC).

• What is the relationship between the generalization error and the number of samples needed to achieve this generalization error?

Page 46: Cooperating Intelligent Systems

instance space X

f

h

f and h disagree

The errorX = the set of all possible examples (instance space).D = the distribution of these examples.H = the hypothesis space (h H).N = the number of training data.

Image adapted from F. Hoffmann @ KTH

] fromdrawn |)()([)(error DfhPh xxx

Page 47: Cooperating Intelligent Systems

Probability for bad hypothesisSuppose we have a bad hypothesis h with error(h) > .What is the probability that it is consistent with N

samples?

• Probability for being inconsistent with one sample = error(h) > .

• Probability for being consistent with one sample = 1 – error(h) < 1 – .

• Probability for being consistent with N independently drawn samples < (1 – )N.

Page 48: Cooperating Intelligent Systems

Probability for bad hypothesisWhat is the probability that the set Hbad of bad

hypotheses with error(h) > contains a consistent hypothesis?

NNhhP )1()1())(error consistent ( bad HH

Page 49: Cooperating Intelligent Systems

Probability for bad hypothesisWhat is the probability that the set Hbad of bad

hypotheses with error(h) > contains a consistent hypothesis?

NNhhP )1()1())(error consistent ( bad HH

If we want this to be less than some constant , then

ln)1ln(ln1 NN HH

Page 50: Cooperating Intelligent Systems

Probability for bad hypothesisWhat is the probability that the set Hbad of bad

hypotheses with error(h) > contains a consistent hypothesis?

NNhhP )1()1())(error consistent ( bad HH

If we want this to be less than some constant , then

)ln(|)ln(|

)1ln()ln(|)ln(|

HHN

Don’t expect to learn very well if H is large

Page 51: Cooperating Intelligent Systems

How make learning work?• Use simple hypotheses

– Always start with the simple ones first• Constrain H with priors

– Do we know something about the domain?– Do we have reasonable a priori beliefs on

parameters?• Use many observations

– Easy to say...• Cross-validation...