12
Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

Embed Size (px)

Citation preview

Page 1: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

Probability and Information

Copyright, 1996 © Dale Carnegie & Associates, Inc.

A brief review

Page 2: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

7/03Data Mining -- Probability

H Liu (ASU) & G Dong (WSU) 2

Probability

Probability provides a way of summarizing uncertainty that comes from our laziness and ignorance - how wonderful it is!

Probability, belief of the truth of a sentence 1 - true, 0 - false, 0<P<1 - intermediate degrees of belief in the

truth of the sentence Degree of truth (fuzzy logic) vs. degree of

belief

Page 3: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

7/03Data Mining -- Probability

H Liu (ASU) & G Dong (WSU) 3

All probability statements must indicate the evidence wrt which the probability is being assessed. Prior or unconditional probability Posterior or conditional probability

Page 4: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

7/03Data Mining -- Probability

H Liu (ASU) & G Dong (WSU) 4

Basic probability notation

Prior probability Proposition: P(Sunny) Random variable: P(Weather=Sunny) Each Random Variable has a domain

Sunny, Cloudy, Rain, Snow Probability distribution P(Weather) =

<.7,.2,.08,.02> A random variable is not a number; a number

may be obtained by observing a RV. A random variable can be continuous or

discrete

Page 5: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

7/03Data Mining -- Probability

H Liu (ASU) & G Dong (WSU) 5

Conditional Probability

Definition P(A|B) = P(A^B)/P(B)

Product rule P(A^B) = P(A|B)P(B)

Probabilistic inference does not work like logical inference.

Page 6: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

7/03Data Mining -- Probability

H Liu (ASU) & G Dong (WSU) 6

The axioms of probability

All probabilities are between 0 and 1

Necessarily true (valid) propositions have probability 1; false (unsatisfiable) have 0

The probability of a disjunctionP(AvB)=P(A)+P(B)-P(A^B)

Page 7: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

7/03Data Mining -- Probability

H Liu (ASU) & G Dong (WSU) 7

The joint probability distribution

Joint completely specifies probability assignments to all propositions in the domain

A probabilistic model consists of a set of random variables (X1, …,Xn).

An atomic event is an assignment of particular values to all the variables.

Marginalization rule for RV Y and Z:

P(Y) = ΣP(Y,z) over z Let’s see an example next.

Page 8: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

7/03Data Mining -- Probability

H Liu (ASU) & G Dong (WSU) 8

Joint Probability

An example of two Boolean variablesToothache !Toothache

Cavity!Cavity

0.04 0.01

0.06 0.89

Observations: mutually exclusive and collectively exhaustiveWhat are

P(Cavity) = P(Cavity V Toothache) = P(Cavity ^ Toothache) = P(Cavity|Toothache) =

Page 9: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

7/03Data Mining -- Probability

H Liu (ASU) & G Dong (WSU) 9

Bayes’ rule

Deriving the rule via the product ruleP(B|A) = P(A|B)P(B)/P(A)

P(A) can be viewed as a normalization factor that makes P(B|A) + (!B|A) = 1

P(A) = P(A|B)P(B)+P(A|!B)P(!B) A more general case is

P(X|Y) = P(Y|X)P(X)/P(Y) Bayes’ rule conditionalized on evidence E

P(X|Y,E) = P(Y|X,E)P(X|E)/P(Y|E)

Page 10: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

7/03Data Mining -- Probability

H Liu (ASU) & G Dong (WSU) 10

Independence

Independent events A, B P(B|A)=P(B), P(A|B)=P(A), P(A,B)=P(A)P(B)

Conditional independence P(X|Y,Z)=P(X|Z) – given Z, X and Y are

independent

Page 11: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

7/03Data Mining -- Probability

H Liu (ASU) & G Dong (WSU) 11

Entropy

Entropy measures homogeneity/purity of sets of examples

Or as information content: the less you need to know (to determine class of new case), the more information you have

With two classes (P,N) in S, p & n instances; let t=p+n. View [p, n] as class distribution of S. Entropy(S) = - (p/t) log2 (p/t) - (n/t) log2 (n/t) E.g., p=9, n=5; Entropy(S) = Entropy([9,5]) = -

(9/14) log2 (9/14) - (5/14) log2 (5/14) = 0.940 E.g., Entropy([14,0])=0; Entropy([7,7])=1

Page 12: Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review

7/03Data Mining -- Probability

H Liu (ASU) & G Dong (WSU) 12

Entropy curve

For p/(p+n) between 0 & 1, the 2-class entropy is 0 when p/(p+n) is 0 1 when p/(p+n) is 0.5 0 when p/(p+n) is 1 monotonically increasing

between 0 and 0.5 monotonically decreasing

between 0.5 and 1 When the data is pure, only

need to send 1 bit

1

0.5