74
Introduction to Articial Intelligence Lecture 4: Representing uncertain knowledge Prof. Gilles Louppe [email protected] 1 / 72

I n telligen ce I n tro d u ctio n to Articia l

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: I n telligen ce I n tro d u ctio n to Articia l

Introduction to Arti�cialIntelligence

Lecture 4: Representing uncertain knowledge

Prof. Gilles [email protected]

1 / 72

Page 2: I n telligen ce I n tro d u ctio n to Articia l

2 / 72

Page 3: I n telligen ce I n tro d u ctio n to Articia l

Probability:

Random variables

Joint and marginal distributions

Conditional distributions

Product rule, Chain rule, Bayes' rule

Inference

Bayesian networks:

Representing uncertain knowledge

Semantics

Construction

Today

Do not overlook this lecture!

―Image credits: CS188, UC Berkeley. 3 / 72

Page 4: I n telligen ce I n tro d u ctio n to Articia l

Quantifying uncertainty

4 / 72

Page 5: I n telligen ce I n tro d u ctio n to Articia l

A ghost is hidden in the gridsomewhere.

Sensor readings tell how close asquare is to the ghost:

On the ghost: red

1 or 2 away: orange

3 away: yellow

4+ away green

Sensors are noisy, but we know the probability values , for all

colors and all distances.

P (color∣distance)

―Image credits: CS188, UC Berkeley. 5 / 72

Page 6: I n telligen ce I n tro d u ctio n to Articia l

0:00 / 1:26

―Image credits: CS188, UC Berkeley. 6 / 72

Page 7: I n telligen ce I n tro d u ctio n to Articia l

UncertaintyGeneral setup:

Observed variables or evidence: agent knows certain things about the stateof the world (e.g., sensor readings).

Unobserved variables: agent needs to reason about other aspects that areuncertain (e.g., where the ghost is).

(Probabilistic) model: agent knows or believes something about how theknown variables relate to the unknown variables.

7 / 72

Page 8: I n telligen ce I n tro d u ctio n to Articia l

How to handle uncertainty?

A purely logical approach either risks falsehood, or leads to conclusions thatare too weak for decision making.

Probabilistic reasoning provides a framework for managing our knowledgeand beliefs.

8 / 72

Page 9: I n telligen ce I n tro d u ctio n to Articia l

Probabilistic assertionsProbabilistic assertions express the agent's inability to reach a de�nite decisionregarding the truth of a proposition.

Probability values summarize effects of

laziness (failure to enumerate all world states)

ignorance (lack of relevant facts, initial conditions, correct model, etc).

Probabilities relate propositions to one's own state of knowledge (or lackthereof).

e.g., P (ghost in cell [3, 2]) = 0.02

9 / 72

Page 10: I n telligen ce I n tro d u ctio n to Articia l

Frequentism vs. Bayesianism

What do probability values represent?

The objectivist frequentist view is that probabilities are real aspects of theuniverse.

i.e., propensities of objects to behave in certain ways.

e.g., the fact that a fair coin comes up heads with probability is a propensity of the coin itself.

The subjectivist Bayesian view is that probabilities are a way ofcharacterizing an agent's beliefs or uncertainty.

i.e., probabilities do not have external physical signi�cance.

This is the interpretation of probabilities that we will use!

0.5

10 / 72

Page 11: I n telligen ce I n tro d u ctio n to Articia l

Kolmogorov's probability theoryBegin with a set , the sample space.

is a sample point or possible world.

A probability space is a sample space equipped with a probability function, i.e. anassignment such that:

1st axiom: , for all

2nd axiom:

3rd axiom: for any set of samples

where the power set of .

Ω

ω ∈ Ω

P : P(Ω) → R

P (ω) ∈ R 0 ≤ P (ω) ω ∈ Ω

P (Ω) = 1

P ({ω , ...,ω }) = P (ω )1 n ∑i=1n

i

P(Ω) Ω

11 / 72

Page 12: I n telligen ce I n tro d u ctio n to Articia l

Example

= the 6 possible rolls of a die.

(for ) are the sample points, each corresponding to an

outcome of the die.

Assignment for a fair die:

Ω

ωi i = 1, ..., 6

P

P (1) = P (2) = P (3) = P (4) = P (5) = P (6) =61

12 / 72

Page 13: I n telligen ce I n tro d u ctio n to Articia l

Random variablesA random variable is a function from the sample space to

some domain de�ning its outcomes.

e.g., such that .

induces a probability distribution for any random variable .

e.g., .

An event is a set of outcomes of the variables ,

such that

X : Ω → DX

Odd : Ω → {true, false} Odd(ω) = (ωmod2 = 1)

P X

P (X = x ) = P (ω)i ∑{ω:X(ω)=x }i

P (Odd = true) = P (1) + P (3) + P (5) = 21

E {(x , ...,x ) }1 n i X , ...,X1 n

P (E) = P (X = x , ...,X = x ).(x ,...,x )∈E1 n

∑ 1 1 n n

13 / 72

Page 14: I n telligen ce I n tro d u ctio n to Articia l

Notations

Random variables are written in upper roman letters: , , etc.

Realizations of a random variable are written in corresponding lower caseletters. E.g., , , ..., could be of outcomes of the random variable .

The probability value of the realization is written as .

When clear from context, this will be abbreviated as .

The probability distribution of the (discrete) random variable is denoted

as . This corresponds e.g. to a vector of numbers, one for each of the

probability values (and not to a single scalar value!).

X Y

x1 x2 xn X

x P (X = x)

P (x)

X

P(X)P (X = x )i

14 / 72

Page 15: I n telligen ce I n tro d u ctio n to Articia l

Probability distributionsFor discrete variables, the probability distribution can be encoded by a discretelist of the probabilities of the outcomes, known as the probability mass function.

One can think of the probability distribution as a table that associates aprobability value to each outcome of the variable.

P(W )

W P

sun 0.6

rain 0.1

fog 0.3

meteor 0.0

―Image credits: CS188, UC Berkeley. 15 / 72

Page 16: I n telligen ce I n tro d u ctio n to Articia l

Joint distributions

A joint probability distribution over a set of random variables

speci�es the probability of each (combined) outcome:

X , ...,X1 n

P (X = x , ...,X = x ) = P (ω)1 1 n n

{ω:X (ω)=x ,...,X (ω)=x }1 1 n n

P(T ,W )

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

16 / 72

Page 17: I n telligen ce I n tro d u ctio n to Articia l

From a joint distribution, the probability of any event can be calculated.

Probability that it is hot and sunny?

Probability that it is hot?

Probability that it is hot or sunny?

Interesting events often correspond to partial assignments, e.g. .P (hot)

17 / 72

Page 18: I n telligen ce I n tro d u ctio n to Articia l

Marginal distributions

The marginal distribution of a subset of a collection of random variables is thejoint probability distribution of the variables contained in the subset.

Intuitively, marginal distributions are sub-tables which eliminate variables.

P(T ,W )

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

P(T )

T P

hot 0.5

cold 0.5

P (t) = P (t,w)∑w

P(W )

W P

sun 0.6

rain 0.4

P (w) = P (t,w)∑t

18 / 72

Page 19: I n telligen ce I n tro d u ctio n to Articia l

Conditional distributions

The conditional probability of a realization given the realization is de�ned as

the ratio of the probability of the joint realization and , and the probability of

:

Indeed, observing rules out all those possible worlds where ,

leaving a set whose total probability is just . Within that set, the worlds for

which satisfy and constitute a fraction .

a b

a b

b

P (a∣b) = .P (b)

P (a, b)

B = b B ≠ b

P (b)A = a A = a ∧ B = b P (a, b)/P (b)

19 / 72

Page 20: I n telligen ce I n tro d u ctio n to Articia l

Conditional distributions are probability distributions over some variables, given�xed values for others.

P(T ,W )

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

P(W ∣T = hot)

T P

sun 0.8

rain 0.2

P(W ∣T = cold)

W P

sun 0.4

rain 0.6

20 / 72

Page 21: I n telligen ce I n tro d u ctio n to Articia l

Select the jointprobabilities matching

the evidence .

Normalize the selection(make it sum to ).

Normalization trick

P(T ,W )

T W P

hot sun 0.4

hot rain 0.1

cold sun 0.2

cold rain 0.3

→ P(T = cold,W )

T W P

cold sun 0.2

cold rain 0.3

T = cold

→ P(W ∣T = cold)

W P

sun 0.4

rain 0.6

1

21 / 72

Page 22: I n telligen ce I n tro d u ctio n to Articia l

Probabilistic inferenceProbabilistic inference is the problem of computing a desired probability fromother known probabilities (e.g., conditional from joint).

We generally compute conditional probabilities.

e.g.,

These represent the agent's beliefs given the evidence.

Probabilities change with new evidence:

e.g.,

e.g.,

e.g.,

Observing new evidence causes beliefs to be updated.

P (on time∣no reported accidents) = 0.9

P (on time∣no reported accidents, 5AM) = 0.95

P (on time∣no reported accidents, rain) = 0.8

P (ghost in [3, 2]∣red in [3, 2]) = 0.99

―Image credits: CS188, UC Berkeley. 22 / 72

Page 23: I n telligen ce I n tro d u ctio n to Articia l

General case

Evidence variables:

Query variables:

Hidden variables:

= all variables

Inference is the problem of computing .

E , ...,E = e , ..., e1 k 1 k

Q

H , ...,H1 r

(Q ∪ E , ...,E ∪H , ...,H )1 k 1 r X , ...,X1 n

P(Q∣e , ..., e )1 k

23 / 72

Page 24: I n telligen ce I n tro d u ctio n to Articia l

Inference by enumerationStart from the joint distribution .

1. Select the entries consistent with the evidence .

2. Marginalize out the hidden variables to obtain the joint of the query and theevidence variables:

3. Normalize:

P(Q,E , ...,E ,H , ...,H )1 k 1 r

E , ...,E = e , ..., e1 k 1 k

P(Q, e , ..., e ) = P(Q,h , ...,h , e , ..., e ).1 k

h ,...,h1 r

∑ 1 r 1 k

Z

P(Q∣e , ..., e )1 k

= P (q, e , ..., e )q

∑ 1 k

= P(Q, e , ..., e )Z

11 k

24 / 72

Page 25: I n telligen ce I n tro d u ctio n to Articia l

?

?

?

Example

P(W )

P(W ∣winter)

P(W ∣winter, hot)

S T W P

summer hot sun 0.3

summer hot rain 0.05

summer cold sun 0.1

summer cold rain 0.05

winter hot sun 0.1

winter hot rain 0.05

winter cold sun 0.15

winter cold rain 0.2

25 / 72

Page 26: I n telligen ce I n tro d u ctio n to Articia l

Complexity

Inference by enumeration can be used to answer probabilistic queries fordiscrete variables (i.e., with a �nite number of values).

However, enumeration does not scale!

Assume a domain described by variables taking at most values.

Space complexity:

Time complexity:

ExerciseCan we reduce the size of the representation of the joint distribution?

n d

O(d )n

O(d )n

26 / 72

Page 27: I n telligen ce I n tro d u ctio n to Articia l

?

?

?

?

Product rule

Example

P (a, b) = P (b)P (a∣b)

P(W )

W P

sun 0.8

rain 0.2

P(D∣W )

D W P

wet sun 0.1

dry sun 0.9

wet rain 0.7

dry rain 0.3

P(D,W )

D W P

wet sun

dry sun

wet rain

dry rain

27 / 72

Page 28: I n telligen ce I n tro d u ctio n to Articia l

Chain ruleMore generally, any joint distribution can always be written as an incrementalproduct of conditional distributions:

P (x ,x ,x )1 2 3

P (x , ...,x )1 n

= P (x )P (x ∣x )P (x ∣x ,x )1 2 1 3 1 2

= P (x ∣x , ...,x )i=1

∏n

i 1 i−1

28 / 72

Page 29: I n telligen ce I n tro d u ctio n to Articia l

Independence and are independent iff, for all and ,

, or

, or

Independence is denoted as .

A B a ∈ DA b ∈ DB

P (a∣b) = P (a)

P (b∣a) = P (b)

P (a, b) = P (a)P (b)

A ⊥ B

29 / 72

Page 30: I n telligen ce I n tro d u ctio n to Articia l

Example 1

The original 32-entry table reduces to one 8-entry and one 4-entry table(assuming 4 values for and boolean values otherwise).

P (toothache, catch, cavity, weather)

= P (toothache, catch, cavity)P (weather)

Weather

―Image credits: CS188, UC Berkeley. 30 / 72

Page 31: I n telligen ce I n tro d u ctio n to Articia l

Example 2

For independent coin �ips, the joint distribution can be fully factored and

represented as the product of 1-entry tables.

n

n

2 → nn

31 / 72

Page 32: I n telligen ce I n tro d u ctio n to Articia l

Conditional independence and are conditionally independent given iff, for all , and

,

, or

, or

Conditional independence is denoted as .

A B C a ∈ DA b ∈ DB

c ∈ DC

P (a∣b, c) = P (a∣c)

P (b∣a, c) = P (b∣c)

P (a, b∣c) = P (a∣c)P (b∣c)

A ⊥ B∣C

32 / 72

Page 33: I n telligen ce I n tro d u ctio n to Articia l

Using the chain rule, the join distribution can be factored as a product ofconditional distributions.

Each conditional distribution may potentially be simpli�ed by conditionalindependence.

Conditional independence assertions allow probabilistic models to scale up.

33 / 72

Page 34: I n telligen ce I n tro d u ctio n to Articia l

Example 1

Assume three random variables , and .

is conditionally independent of , given . Therefore,

we can write:

In this case, the representation of the joint distribution reduces to

independent numbers (instead of ).

Toothache Catch Cavity

Catch Toothache Cavity

P (toothache, catch, cavity)

= P (toothache∣catch, cavity)P (catch∣cavity)P (cavity)

= P (toothache∣cavity)P (catch∣cavity)P (cavity)

2 + 2 + 12 − 1n

―Image credits: CS188, UC Berkeley. 34 / 72

Page 35: I n telligen ce I n tro d u ctio n to Articia l

Example 2 (Naive Bayes)

More generally, from the product rule, we have

Assuming pairwise conditional independence between the effects given thecause, it comes:

This probabilistic model is called a naive Bayes model.

The complexity of this model is instead of without the

conditional independence assumptions.

Naive Bayes can work surprisingly well in practice, even when theassumptions are wrong.

P (cause, effect , ..., effect ) = P (effect , ..., effect ∣cause)P (cause)1 n 1 n

P (cause, effect , ..., effect ) = P (cause) P (effect ∣cause)1 n

i

∏ i

O(n) O(2 )n

35 / 72

Page 36: I n telligen ce I n tro d u ctio n to Articia l

Study the next slide. Twice.

35 / 72

Page 37: I n telligen ce I n tro d u ctio n to Articia l

The product rule de�nes two ways to factor thejoint distribution of two random variables.

Therefore,

The Bayes' rule

is the prior belief on .

is the probability of the evidence .

is the posterior belief on , given the evidence .

is the conditional probability of given . Depending on the context,

this term is called the likelihood.

P (a, b) = P (a∣b)P (b) = P (b∣a)P (a)

P (a∣b) = .P (b)

P (b∣a)P (a)

P (a) a

P (b) b

P (a∣b) a b

P (b∣a) b a

36 / 72

Page 38: I n telligen ce I n tro d u ctio n to Articia l

The Bayes' rule is the foundation of many AI systems.

P (a∣b) =P (b)

P (b∣a)P (a)

37 / 72

Page 39: I n telligen ce I n tro d u ctio n to Articia l

Example 1: diagnostic probability from causal probability.

where

quanti�es the relationship in the causal direction.

describes the diagnostic direction.

Let =stiff neck and =meningitis. Given , ,

it comes

P (cause∣effect) =P (effect)

P (effect∣cause)P (cause)

P (effect∣cause)

P (cause∣effect)

S M P (s∣m) = 0.7 P (m) = 1/50000P (s) = 0.01,

P (m∣s) = = = 0.0014.P (s)

P (s∣m)P (m)0.01

0.7 × 1/50000

38 / 72

Page 40: I n telligen ce I n tro d u ctio n to Articia l

Example 2: Ghostbusters, revisited

Let us assume a random variable for the ghost location and a set of

random variables for the individual readings.

We start with a uniform prior distribution over ghost locations.

We assume a sensor reading model .

That is, we know what the sensors do.

= reading color measured at

e.g.,

Two readings are conditionally independent, given the ghost position.

G

Ri,j

P(G)

P(R ∣G)i,j

Ri,j [i, j]P (R = yellow∣G = [1, 1]) = 0.11,1

39 / 72

Page 41: I n telligen ce I n tro d u ctio n to Articia l

We can calculate the posterior distribution using Bayes' rule:

For the next reading , this posterior distribution becomes the prior

distribution over ghost locations, which we update similarly.

P(G∣R )i,j

P(G∣R ) = .i,j P(R )i,j

P(R ∣G)P(G)i,j

Ri ,j′ ′

40 / 72

Page 42: I n telligen ce I n tro d u ctio n to Articia l

0:00 / 1:02

―Image credits: CS188, UC Berkeley. 41 / 72

Page 43: I n telligen ce I n tro d u ctio n to Articia l

Example 3: AI for Science

42 / 72

Page 44: I n telligen ce I n tro d u ctio n to Articia l

SM withparameters

Simulated observables Real observations

θx xobs

43 / 72

Page 45: I n telligen ce I n tro d u ctio n to Articia l

Given some observation and prior beliefs , science is about updating one's

knowledge, which may be framed as computing

x p(θ)

p(θ∣x) = .p(x)

p(x∣θ)p(θ)

44 / 72

Page 46: I n telligen ce I n tro d u ctio n to Articia l

Probabilistic reasoning

45 / 72

Page 47: I n telligen ce I n tro d u ctio n to Articia l

Representing knowledge

The joint probability distribution can answer any question about the domain.

However, its representation can become intractably large as the number ofvariable grows.

Independence and conditional independence reduce the number ofprobabilities that need to be speci�ed in order to de�ne the full jointdistribution.

These relationships can be represented explicitly in the form of a Bayesiannetwork.

46 / 72

Page 48: I n telligen ce I n tro d u ctio n to Articia l

Bayesian networksA Bayesian network is a directed acyclic graph (DAG) in which:

Each node corresponds to a random variable.

Can be observed or unobserved.

Can be discrete or continuous.

Each edge indicates dependency relationships.

If there is an arrow from node to node , is said to be a parent of .

Each node is annotated with a conditional probability distribution

that quanti�es the effect of the parents on the node.

X Y X Y

Xi

P(X ∣parents(X ))i i

47 / 72

Page 49: I n telligen ce I n tro d u ctio n to Articia l

Example 1

I am at work, neighbor John calls to say my alarm is ringing, but neighbor Marydoes not call. Sometimes it's set off by minor earthquakes. Is there a burglar?

Variables: , , , , .

Network topology from "causal" knowledge:

A burglar can set the alarm off

An earthquake can set the alaram off

The alarm can cause Mary to call

The alarm can cause John to call

Burglar Earthquake Alarm JohnCalls MaryCalls

―Image credits: CS188, UC Berkeley. 48 / 72

Page 50: I n telligen ce I n tro d u ctio n to Articia l

49 / 72

Page 51: I n telligen ce I n tro d u ctio n to Articia l

SemanticsA Bayesian network implicitly encodes the full joint distribution as the product ofthe local distributions:

Example

P (x , ...,x ) = P (x ∣parents(X ))1 n

i=1

∏n

i i

P (j,m, a, ¬b, ¬e) = P (j∣a)P (m∣a)P (a∣¬b, ¬e)P (¬b)P (¬e)

= 0.9 × 0.7 × 0.001 × 0.999 × 0.998

≈ 0.00063

50 / 72

Page 52: I n telligen ce I n tro d u ctio n to Articia l

Why does result in the proper joint probability?

By the chain rule, .

Provided that we assume conditional independence of with its

predecessors in the ordering given the parents, and provided

:

Therefore .

P (x ∣parents(X ))∏i=1n

i i

P (x , ...,x ) = P (x ∣x , ...,x )1 n ∏i=1n

i 1 i−1

Xi

parents(X ) ⊆ {X , ...,X }i 1 i−1

P (x ∣x , ...,x ) = P (x ∣parents(X ))i 1 i−1 i i

P (x , ...,x ) = P (x ∣parents(X ))1 n ∏i=1n

i i

51 / 72

Page 53: I n telligen ce I n tro d u ctio n to Articia l

Example 2

The topology of the network encodes conditional independence assertions:

is independent of the other variables.

and are conditionally independent given .

Weather

Toothache Catch Cavity

―Image credits: CS188, UC Berkeley. 52 / 72

Page 54: I n telligen ce I n tro d u ctio n to Articia l

Example 3

―Image credits: CS188, UC Berkeley.

P(R)

R P

r 0.25

¬r 0.75

P(T ∣R)

R T P

r t 0.75

r ¬t 0.25

¬r t 0.5

¬r ¬t 0.5

53 / 72

Page 55: I n telligen ce I n tro d u ctio n to Articia l

Example 3 (bis)

P(T )

T P

t 9/16

¬t 7/16

P(R∣T )

T R P

t r 1/3

t ¬r 2/3

¬t r 1/7

¬t ¬r 6/7

―Image credits: CS188, UC Berkeley. 54 / 72

Page 56: I n telligen ce I n tro d u ctio n to Articia l

ConstructionBayesian networks are correct representations of the domain only if each node isconditionally independent of its other predecessors in the node ordering, givenits parents.

Construction algorithm

1. Choose some ordering of the variables .

2. For to :

1. Add to the network.

2. Select a minimal set of parents from such that

.

3. For each parent, insert a link from the parent to .

4. Write down the CPT.

X , ...,X1 n

i = 1 n

Xi

X , ...,X1 i−1

P (x ∣x , ...,x ) = P (x ∣parents(X ))i 1 i−1 i i

Xi

55 / 72

Page 57: I n telligen ce I n tro d u ctio n to Articia l

ExerciseDo these networks represent the same distribution?

56 / 72

Page 58: I n telligen ce I n tro d u ctio n to Articia l

Compactness

A CPT for boolean with boolean parents has rows for the

combinations of parent values.

Each row requires one number for .

The number for is just .

If each variable has no more than parents, the complete network requires

numbers.

i.e., grows linearly with , vs. for the full joint distribution.

For the burglary net, we need numbers (vs.

).

Compactness depends on the node ordering.

Xi k 2k

p X = truei

X = falsei 1 − p

k

O(n× 2 )k

n O(2 )n

1 + 1 + 4 + 2 + 2 = 102 − 1 = 315

57 / 72

Page 59: I n telligen ce I n tro d u ctio n to Articia l

IndependenceImportant question: Are two nodes independent given certain evidence?

If yes, this can be proved using algebra (tedious).

If no, this can be proved with a counter example.

Example: Are and necessarily independent?X Z

58 / 72

Page 60: I n telligen ce I n tro d u ctio n to Articia l

Is independent of ? No.

Counter-example:

Low pressure causes rain causestraf�c, high pressure causes norain causes no traf�c.

In numbers:

,

,

,

: low pressure, : rain, : traf�c.

Cascades

X Z

P (y∣x) = 1

P (z∣y) = 1

P (¬y∣¬x) = 1

P (¬z∣¬y) = 1

X Y Z

P (x, y, z) = P (x)P (y∣x)P (z∣y)

―Image credits: CS188, UC Berkeley. 59 / 72

Page 61: I n telligen ce I n tro d u ctio n to Articia l

Is independent of , given ? Yes.

We say that the evidence along thecascade "blocks" the in�uence.

: low pressure, : rain, : traf�c.

X Z Y

P (z∣x, y) =P (x, y)

P (x, y, z)

=P (x)P (y∣x)

P (x)P (y∣x)P (z∣y)

= P (z∣y) X Y Z

P (x, y, z) = P (x)P (y∣x)P (z∣y)

―Image credits: CS188, UC Berkeley. 60 / 72

Page 62: I n telligen ce I n tro d u ctio n to Articia l

Common parent

Is independent of ? No.

Counter-example:

Project due causes both forumsbusy and lab full.

In numbers:

,

,

,

: forum busy, : project due, : lab

full.

X Z

P (x∣y) = 1

P (¬x∣¬y) = 1

P (z∣y) = 1

P (¬z∣¬y) = 1

X Y Z

P (x, y, z) = P (y)P (x∣y)P (z∣y)

―Image credits: CS188, UC Berkeley. 61 / 72

Page 63: I n telligen ce I n tro d u ctio n to Articia l

Is independent of , given ? Yes

Observing the parent blocks thein�uence between the children.

: forum busy, : project due, : lab

full.

X Z Y

P (z∣x, y) =P (x, y)

P (x, y, z)

=P (y)P (x∣y)

P (y)P (x∣y)P (z∣y)

= P (z∣y)

X Y Z

P (x, y, z) = P (y)P (x∣y)P (z∣y)

―Image credits: CS188, UC Berkeley. 62 / 72

Page 64: I n telligen ce I n tro d u ctio n to Articia l

v-structures

Are and independent? Yes.

The ballgame and the rain causetraf�c, but they are notcorrelated.

(Prove it!)

Are and independent given ?

No!

Seeing traf�c puts the rain andthe ballgame in competition asexplanation.

This is backwards from theprevious cases. Observing a childnode activates in�uence betweenparents.

: rain, : ballgame, : traf�c.

X Y

X Y Z

X Y Z

P (x, y, z) = P (x)P (y)P (z∣x, y)

―Image credits: CS188, UC Berkeley. 63 / 72

Page 65: I n telligen ce I n tro d u ctio n to Articia l

d-separation

Let us assume a complete Bayesian network. Are and conditionally

independent given evidence ?

Consider all (undirected) paths from to :

If one or more active path, then independence is not guaranteed.

Otherwise (i.e., all paths are inactive), then independence is guaranteed.

Xi Xj

Z = z , ...,Z = z1 1 m m

Xi Xj

64 / 72

Page 66: I n telligen ce I n tro d u ctio n to Articia l

A path is active if each triple along the path isactive:

Cascade where is

unobserved (either direction).

Common parent where is

unobserved.

v-structure where or one of

its descendents is observed.

A→ B → C B

A← B → C B

A→ B ← C B

―Image credits: CS188, UC Berkeley. 65 / 72

Page 67: I n telligen ce I n tro d u ctio n to Articia l

Example

?

?

?

?

?

L ⊥ T ∣T′

L ⊥ B

L ⊥ B∣T

L ⊥ B∣T ′

L ⊥ B∣T ,R

66 / 72

Page 68: I n telligen ce I n tro d u ctio n to Articia l

Causality?When the network re�ects the true causal patterns:

Often more compact (nodes have fewer parents).

Often easier to think about.

Often easier to elicit from experts.

But, Bayesian networks need not be causal.

Sometimes no causal network exists over the domain (e.g., if variables are missing).

Edges re�ect correlation, not causation.

What do the edges really mean then?

Topology may happen to encode causal structure.

Topology really encodes conditional independence.

―Image credits: CS188, UC Berkeley. 67 / 72

Page 69: I n telligen ce I n tro d u ctio n to Articia l

Correlation does not imply causation.

Causes cannot be expressed in the language ofprobability theory.

Judea Pearl

68 / 72

Page 70: I n telligen ce I n tro d u ctio n to Articia l

Philosophers have tried to de�ne causation in terms of probability:

causes if raises the probability of .

However, the inequality

fails to capture the intuition behind "probability raising", which is fundamentally acausal concept connoting a causal in�uence of over .

X = x

Y = y X = x Y = y

P (y∣x) > P (y)

X = x Y = y

69 / 72

Page 71: I n telligen ce I n tro d u ctio n to Articia l

The correct formulation should read

where stands for an external intervention where is set to the

value instead of being observed.

P (y∣do(X = x)) > P (y),

do(X = x) X

x

70 / 72

Page 72: I n telligen ce I n tro d u ctio n to Articia l

Observing vs. intervening

The reading in barometer is useful to predict rain.

But hacking a barometer will not cause rain!

P (rain∣Barometer = high) > P (rain∣Barometer = low)

P (rain∣Barometer hacked to high) = P (rain∣Barometer hacked to low)

71 / 72

Page 73: I n telligen ce I n tro d u ctio n to Articia l

SummaryUncertainty arises because of laziness and ignorance. It is inescapable incomplex non-deterministic or partially observable environments.

Probabilistic reasoning provides a framework for managing our knowledgeand beliefs, with the Bayes' rule acting as the workhorse for inference.

A Bayesian Network speci�es a full joint distribution. They are oftenexponentially smaller than an explicitly enumerated joint distribution.

The structure of a Bayesian network encodes conditional independenceassumptions between random variables.

72 / 72

Page 74: I n telligen ce I n tro d u ctio n to Articia l

The end.

72 / 72