CS 541: Artificial Intelligence

CS 541: Artificial Intelligence

Lecture VI: Modeling Uncertainty and Bayesian Network

Schedule

CA: Yizhou Lin Email Address: [email protected]

Week Date Topic Reading Homework**

1 08/29/2012 Introduction & Intelligent Agent Ch 1 & 2 N/A

2 09/05/2012 Search: search strategy and heuristic search Ch 3 & 4s HW1 (Search)

3 09/12/2012 Search: Constraint Satisfaction & Adversarial Search Ch 4s & 5 & 6 Teaming Due

4 09/19/2012 Logic: Logic Agent & First Order Logic Ch 7 & 8s HW1 due, Midterm Project (Game)

5 09/26/2012 Logic: Inference on First Order Logic Ch 8s & 9 6 10/03/2012 No class 7 10/10/2012 Uncertainty and Bayesian Network Ch 13 & Ch14s HW2 (Logic)

8 10/17/2012 Midterm Presentation 　 Midterm Project Due

9 10/24/2012 Inference in Baysian Network Ch 14s HW2 Due, HW3 (Probabilistic Reasoning)

10 10/31/2012 Probabilistic Reasoning over Time Ch 15 11 11/07/2012 Machine Learning HW3 due,

12 11/14/2012 Markov Decision Process Ch 18 & 20 HW4 (Probabilistic Reasoning Over Time)

13 11/21/2012 No class Ch 16 14 11/29/2012 Reinforcement learning Ch 21 HW4 due

15 12/05/2012 Final Project Presentation Final Project Due

Re-cap of Lecture IV – Part I Inference in First-order logic:

Reducing first-order inference to propositional inference Unification Generalized Modus Ponens Forward and backward chaining Logic programming Resolution: rules to transform to CNF

Modeling Uncertainty

Lecture VI—Part I

Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes’ Rule

Uncertainty

Uncertainty Let action At=leave for airport t minutes before flight Will At get me there on time? Problems:

1) Partial observability (road state, other drivers' plans, etc.) 2) Noisy sensors (KCBS traffic reports) 3) Uncertainty in action outcomes (flat tire, etc.) 4) Immense complexity of modelling and predicting traffic

Hence a purely logical approach either 1) Risks falsehood: " A25 will get me there on time" 2) Leads to conclusions that are too weak for decision making: A25 will get

me there on time if there's no accident on the bridge, and it doesn't rain and my tires remain intact etc etc."

A1440 might reasonably be said to get me there on time but I'd have to stay overnight in the airport …

Methods for handling uncertainty Default or nonmonotonic logic:

Assume my car does not have a flat tire Assume A25 works unless contradicted by evidence

Issues: What assumptions are reasonable? How to handle contradiction? Rules with fudge factors:

Issues: Problems with combination, e.g., Sprinkler causes Rain?? Probability

Given the available evidence, A25 will get me there on time with probability 0.04 Mahaviracarya (9th Centuary.), Cardamo (1565) theory of gambling Fuzzy logic handles degree of truth NOT uncertainty e.g.,

WetGrass is true to degree 0.2

Probability

Probability Probabilistic assertions summarize effects of

Laziness: failure to enumerate exceptions, qualifications, etc. Ignorance: lack of relevant facts, initial conditions, etc.

Subjective or Bayesian probability: Probabilities relate propositions to one's own state of

knowledge E.g.,

These are not claims of a "probabilistic tendency" in the current situation (but might be learned from past experience of similar

situations) Probabilities of propositions change with new evidence:

E.g., (Analogous to logical entailment status , not truth.)

Making decisions under uncertainty Suppose I believe the following:

Which action to choose? Depends on my preferences for missing flight vs.

airport cuisine, etc. Utility theory is used to represent and infer preferences Decision theory = utility theory + probability theory

Probability basics Begin with a set -- the sample space

E.g., 6 possible rolls of a die. is a sample point/possible world/atomic event

A probability space or probability model is a sample space with an assignment for every s.t.

, e.g., . An event A is any subset of

E.g.,

Random variables A random variable is a function from sample points

to some range, e.g., the reals or Booleans E.g., .

P induces a probability distribution for any r.v. X:

e.g.,

Propositions Think of a proposition as the event (set of sample points), where

the proposition is true Given Boolean random variables A and B:

event = set of sample points where event = set of sample points where event = points where and

Often in AI applications, the sample points are defined by the values of a set of random variables, I.e., the sample space is the Cartesian product of the ranges of

the variables With Boolean variables, sample point = propositional logic model

E.g., , , or . Proposition = disjunction of atomic events in which it is true

E.g.,

Why use probability The definitions imply that certain logically related

events must have related probabilities E.g.,

de Finetti (1931): an agent who bets according to probabilities that violate these axioms can be forced to bet so as to lose money regardless of outcome.

Syntax for propositions

Syntax for propositions Propositional or Boolean random variables

E.g., (do I have a cavity?) is a proposition, also written

Discrete random variables (finite or infinite) E.g., is one of is a proposition

Values must be exhaustive and mutually exclusive Continuous random variables (bounded or

unbounded) E.g., ; also allow, e.g., .

Arbitrary Boolean combinations of basic propositions

Prior probability Prior or unconditional probabilities of propositions

E.g., and correspond to belief prior to arrival of any (new) evidence Probability distribution gives values for all possible assignments:

(normalized, i.e., sums to 1) Joint probability distribution for a set of r.v.s gives the probability of

every atomic event on those r.v.s (i.e., every sample point) = a matrix of values:

Every question about a domain can be answered by the joint distribution because every event is a sum of sample points

Probability for continuous variables Express distribution as a parameterized function of value:

= uniform density between and

Here is a density; integrates to . I.e.,really means

Gaussian density

𝑃 (𝑥 )= 1√2𝜋 𝜎

𝑒− (𝑥−𝜇 )2 / 2𝜎2

Conditional probability Conditional or posterior probabilities

E.g., i.e., given that toothache is all I know NOT "if then 80% chance of "

Notation for conditional distributions: = 2-element vector of 2-element vectors

If we know more, e.g., is also given, then we have

Note: the less specific belief remains valid after more evidence arrives, but is not always useful

New evidence may be irrelevant, allowing simplification, e.g.,

This kind of inference, sanctioned by domain knowledge, is crucial

Conditional probability Definition of conditional probability:

Product rule gives an alternative formulation:

A general version holds for whole distributions, e.g.,

(View as a set of equations, not matrix mult.) Chain rule is derived by successive application of product rule:

Inference

Inference by enumeration Start with the joint distribution:

For any proposition , sum the atomic events where it is true:

Inference by enumeration Start with the joint distribution:


Start with the joint distribution:


Inference by enumeration

Start with the joint distribution:

General idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables

Normalization

Inference by enumeration Let be all the variables. Typically, we want

The posterior joint distribution of the query variables , given specific values for the evidence variables

Let the hidden variables be Then the required summation of joint entries is done by

summing out the hidden variables:

The terms in the summation are joint entries because , , and together exhaust the set of random variables

Obvious problems: 1) Worst-case time complexity where is the largest arity 2) Space complexity to store the joint distribution 3) How to find the numbers for entries???

Independence

Independence A and B are independent iff or or

16 entries reduced to 12; for independent biased coins, Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, but none

of which are independent. What to do?

Conditional independence has independent entries If I have a cavity, the probability that the probe catches in it

doesn't depend on whether I have a toothache: (1)

The same independence holds if I haven't got a cavity: (2)

is conditionally independent of given :

Equivalent statements:

Conditional independence Write out full joint distribution using chain rule:

I.e., 2+2+1 = 5 independent numbers (equations 1 and 2 remove 2)

In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in to linear in .

Conditional independence is our most basic and robust form of knowledge about uncertain environments.

Bayes’ Rule

Bayes’ Rule Product rule Bayes' rule Or in distribution form Useful for assessing diagnostic probability from

causal probability: E.g., let be meningitis, be stiff neck:

Note: posterior probability of meningitis is still very small!

Bayes' Rule and conditional independence

This is an example of a naive Bayes model:

Total number of parameters is linear in

Wumpus world

iff [, ] contains a pit iff [, ] is breezy Include only in the probability model

The full joint distribution is Apply product rule: Do it this way to get . First term: if pits are adjacent to breezes,

otherwise Second term: pits are placed randomly, probability

0.2 per square: for pits.

Specifying the probability model

We know the following facts:

Query is Define s other than and For inference by enumeration, we have

Grows exponentially with number of squares!

Observation and query

Using conditional independence Basic insight: observations are conditionally

independent of other hidden squares given neighboring hidden squares

Define

Manipulate query into a form where we can use this

Using conditional independence

Using conditional independence

Summary Probability is a rigorous formalism for uncertain

knowledge Joint probability distribution species probability of

every atomic event Queries can be answered by summing over atomic

events For nontrivial domains, we must find a way to reduce

the joint size Independence and conditional independence provide

the tools

Bayesian Network

Lecture VI—Part II

Outline Syntax Semantics Parameterized distributions

Bayesian networks A simple, graphical notation for conditional

independence assertions, and hence for compact specification of full joint distributions

Syntax: A set of nodes, one per variable A directed, acyclic graph (link"directly influences") A conditional distribution for each node given its parents:

In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over for each combination of parent values

Examples Topology of network encodes conditional

independence assertions:

is independent of the other variables and are conditionally independent given Cavity

Example I'm at work, neighbor John calls to say my alarm is

ringing, but neighbor Mary doesn't call. Sometimes it's set by minor earthquakes. Is there a burglar?

Variables: , , , , Network topology reflects "causal" knowledge:

A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

Example

Compactness A CPT for Boolean with Boolean parents has rows for the

combinations of parent values Each row requires one number for

The number for is just If each variable has no more than parents, the complete network requires numbers I.e., grows linearly with , vs. for the full joint distribution For burglary net, numbers (vs. )

Compactness Global semantics define the full joint distribution as the

product of the local conditional distributions:

e.g., =

Compactness Global semantics define the full joint distribution as the

product of the local conditional distributions:

e.g., =

Local semantics Local semantics: each node is conditionally

independent of its non-descendants given its parents

Theorem:

Markov blanket Each node is conditionally independent of all

others given its Markov blanket: parents + children + children's

parents

Constructing Bayesian networks Need a method such that a series of locally

testable assertions of conditional independence guarantees the required global semantics 1. Choose an ordering of variables 2. For to

add to the network select parents from such that

This choice of parents guarantees the global semantics:

(chain rule) (by construction)

Example Suppose we choose the ordering , , , ,

?

Example Suppose we choose the ordering , , , ,

? No? ?

Suppose we choose the ordering , , , ,

? No? ? No? ?

Example


? No? ? No? Yes? No??

Example


? No? ? No? No? No? No? Yes

Example

Deciding conditional independence is hard in noncausal directions (Causal models and conditional independence seem hardwired for

humans!) Assessing conditional probabilities is hard in noncausal directions Network is less compact: numbers needed

Example

Example: Car diagnosis Initial evidence: car won't start Testable variables (green), "broken, so fix it" variables (orange) Hidden variables (gray) ensure sparse structure, reduce parameters

Example: Car insurance

Compact conditional distributions CPT grows exponentially with number of parents CPT becomes infinite with continuous-valued

parent or child Solution: canonical distributions that are defined

compactly Deterministic nodes are the simplest case:

for some function E.g., Boolean functions

E.g., numerical relationships among continuous variables

Compact conditional distributions Noisy-OR distributions model multiple non-interacting

causes 1) Parents include all causes (can add leak node) 2) Independent failure probability for each cause alone

Number of parameters linear in number of parents

Hybrid (discrete+continuous) networks Discrete (? and ?); continuous ( and )

Option 1: discretization--possibly large errors, large CPTs Option 2: finitely parameterized canonical families 1) Continuous variable, discrete+continuous parents (e.g., ) 2) Discrete variable, continuous parents (e.g., ?)

Continuous child variables Need one conditional density function for child

variable given continuous parents, for each possible assignment to discrete parents

Most common is the linear Gaussian model, e.g.,: Mean cost varies linearly with , variance is fixed Linear variation is unreasonable over the full range

but works OK if the likely range of Harvest is narrow

Continuous child variables

All-continuous network with LG distributions full joint distribution is a multivariate Gaussian

Discrete+continuous LG network is a conditional Gaussian network i.e., a multivariate Gaussian over all continuous variables for each combination of discrete variable values

Discrete variable w/ continuous parents Probability of ? given Cost should be a "soft"

threshold:

Probit distribution uses integral of Gaussian:

Why the probit 1. It's sort of the right shape 2. Can view as hard threshold whose location is subject to noise

Discrete variable Sigmoid (or logit) distribution also used in neural networks: Sigmoid has similar shape to probit but much longer tails:

Summary Bayes nets provide a natural representation for (causally induced)

conditional independence Topology + CPTs = compact representation of joint distribution Generally easy for (non)experts to construct Canonical distributions (e.g., noisy-OR) = compact representation of CPTs Continuous variables parameterized distributions (e.g., linear Gaussian)

Documents

CS 541: Artificial Intelligence