71
CS 541: Artificial Intelligence Lecture VI: Modeling Uncertainty and Bayesian Network

CS 541: Artificial Intelligence

  • Upload
    kishi

  • View
    22

  • Download
    0

Embed Size (px)

DESCRIPTION

CS 541: Artificial Intelligence. Lecture VI: Modeling Uncertainty and Bayesian Network. Schedule. CA: Yizhou Lin Email Address : [email protected]. Re-cap of Lecture IV – Part I. Inference in First-order logic: Reducing first-order inference to propositional inference Unification - PowerPoint PPT Presentation

Citation preview

Page 1: CS 541: Artificial Intelligence

CS 541: Artificial Intelligence

Lecture VI: Modeling Uncertainty and Bayesian Network

Page 2: CS 541: Artificial Intelligence

Schedule

CA: Yizhou Lin Email Address: [email protected]

Week Date Topic Reading Homework**

1 08/29/2012 Introduction & Intelligent Agent Ch 1 & 2 N/A

2 09/05/2012 Search: search strategy and heuristic search Ch 3 & 4s HW1 (Search)

3 09/12/2012 Search: Constraint Satisfaction & Adversarial Search Ch 4s & 5 & 6  Teaming Due

4 09/19/2012 Logic: Logic Agent & First Order Logic Ch 7 & 8s HW1 due, Midterm Project  (Game)

5 09/26/2012 Logic: Inference on First Order Logic Ch 8s & 9  6 10/03/2012 No class     7 10/10/2012 Uncertainty and Bayesian Network  Ch 13 & Ch14s  HW2 (Logic)

8 10/17/2012 Midterm Presentation   Midterm Project Due

9 10/24/2012 Inference in Baysian Network Ch 14s HW2 Due, HW3 (Probabilistic Reasoning)

10 10/31/2012 Probabilistic Reasoning over Time Ch 15  11 11/07/2012 Machine Learning    HW3 due,

12 11/14/2012 Markov Decision Process Ch 18 & 20 HW4 (Probabilistic Reasoning Over Time)

 13 11/21/2012 No class Ch 16  14 11/29/2012 Reinforcement learning Ch 21 HW4 due

15 12/05/2012 Final Project Presentation   Final Project Due

Page 3: CS 541: Artificial Intelligence

Re-cap of Lecture IV – Part I Inference in First-order logic:

Reducing first-order inference to propositional inference Unification Generalized Modus Ponens Forward and backward chaining Logic programming Resolution: rules to transform to CNF

Page 4: CS 541: Artificial Intelligence

Modeling Uncertainty

Lecture VI—Part I

Page 5: CS 541: Artificial Intelligence

Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes’ Rule

Page 6: CS 541: Artificial Intelligence

Uncertainty

Page 7: CS 541: Artificial Intelligence

Uncertainty Let action At=leave for airport t minutes before flight Will At get me there on time? Problems:

1) Partial observability (road state, other drivers' plans, etc.) 2) Noisy sensors (KCBS traffic reports) 3) Uncertainty in action outcomes (flat tire, etc.) 4) Immense complexity of modelling and predicting traffic

Hence a purely logical approach either 1) Risks falsehood: " A25 will get me there on time" 2) Leads to conclusions that are too weak for decision making: A25 will get

me there on time if there's no accident on the bridge, and it doesn't rain and my tires remain intact etc etc."

A1440 might reasonably be said to get me there on time but I'd have to stay overnight in the airport …

Page 8: CS 541: Artificial Intelligence

Methods for handling uncertainty Default or nonmonotonic logic:

Assume my car does not have a flat tire Assume A25 works unless contradicted by evidence

Issues: What assumptions are reasonable? How to handle contradiction? Rules with fudge factors:

Issues: Problems with combination, e.g., Sprinkler causes Rain?? Probability

Given the available evidence, A25 will get me there on time with probability 0.04 Mahaviracarya (9th Centuary.), Cardamo (1565) theory of gambling Fuzzy logic handles degree of truth NOT uncertainty e.g.,

WetGrass is true to degree 0.2

Page 9: CS 541: Artificial Intelligence

Probability

Page 10: CS 541: Artificial Intelligence

Probability Probabilistic assertions summarize effects of

Laziness: failure to enumerate exceptions, qualifications, etc. Ignorance: lack of relevant facts, initial conditions, etc.

Subjective or Bayesian probability: Probabilities relate propositions to one's own state of

knowledge E.g.,

These are not claims of a "probabilistic tendency" in the current situation (but might be learned from past experience of similar

situations) Probabilities of propositions change with new evidence:

E.g., (Analogous to logical entailment status , not truth.)

Page 11: CS 541: Artificial Intelligence

Making decisions under uncertainty Suppose I believe the following:

Which action to choose? Depends on my preferences for missing flight vs.

airport cuisine, etc. Utility theory is used to represent and infer preferences Decision theory = utility theory + probability theory

Page 12: CS 541: Artificial Intelligence

Probability basics Begin with a set -- the sample space

E.g., 6 possible rolls of a die. is a sample point/possible world/atomic event

A probability space or probability model is a sample space with an assignment for every s.t.

, e.g., . An event A is any subset of

E.g.,

Page 13: CS 541: Artificial Intelligence

Random variables A random variable is a function from sample points

to some range, e.g., the reals or Booleans E.g., .

P induces a probability distribution for any r.v. X:

e.g.,

Page 14: CS 541: Artificial Intelligence

Propositions Think of a proposition as the event (set of sample points), where

the proposition is true Given Boolean random variables A and B:

event = set of sample points where event = set of sample points where event = points where and

Often in AI applications, the sample points are defined by the values of a set of random variables, I.e., the sample space is the Cartesian product of the ranges of

the variables With Boolean variables, sample point = propositional logic model

E.g., , , or . Proposition = disjunction of atomic events in which it is true

E.g.,

Page 15: CS 541: Artificial Intelligence

Why use probability The definitions imply that certain logically related

events must have related probabilities E.g.,

de Finetti (1931): an agent who bets according to probabilities that violate these axioms can be forced to bet so as to lose money regardless of outcome.

Page 16: CS 541: Artificial Intelligence

Syntax for propositions

Page 17: CS 541: Artificial Intelligence

Syntax for propositions Propositional or Boolean random variables

E.g., (do I have a cavity?) is a proposition, also written

Discrete random variables (finite or infinite) E.g., is one of is a proposition

Values must be exhaustive and mutually exclusive Continuous random variables (bounded or

unbounded) E.g., ; also allow, e.g., .

Arbitrary Boolean combinations of basic propositions

Page 18: CS 541: Artificial Intelligence

Prior probability Prior or unconditional probabilities of propositions

E.g., and correspond to belief prior to arrival of any (new) evidence Probability distribution gives values for all possible assignments:

(normalized, i.e., sums to 1) Joint probability distribution for a set of r.v.s gives the probability of

every atomic event on those r.v.s (i.e., every sample point) = a matrix of values:

Every question about a domain can be answered by the joint distribution because every event is a sum of sample points

Page 19: CS 541: Artificial Intelligence

Probability for continuous variables Express distribution as a parameterized function of value:

= uniform density between and

Here is a density; integrates to . I.e.,really means

Page 20: CS 541: Artificial Intelligence

Gaussian density

𝑃 (𝑥 )= 1√2𝜋 𝜎

𝑒− (𝑥−𝜇 )2 / 2𝜎2

Page 21: CS 541: Artificial Intelligence

Conditional probability Conditional or posterior probabilities

E.g., i.e., given that toothache is all I know NOT "if then 80% chance of "

Notation for conditional distributions: = 2-element vector of 2-element vectors

If we know more, e.g., is also given, then we have

Note: the less specific belief remains valid after more evidence arrives, but is not always useful

New evidence may be irrelevant, allowing simplification, e.g.,

This kind of inference, sanctioned by domain knowledge, is crucial

Page 22: CS 541: Artificial Intelligence

Conditional probability Definition of conditional probability:

Product rule gives an alternative formulation:

A general version holds for whole distributions, e.g.,

(View as a set of equations, not matrix mult.) Chain rule is derived by successive application of product rule:

Page 23: CS 541: Artificial Intelligence

Inference

Page 24: CS 541: Artificial Intelligence

Inference by enumeration Start with the joint distribution:

For any proposition , sum the atomic events where it is true:

Page 25: CS 541: Artificial Intelligence

Inference by enumeration Start with the joint distribution:

For any proposition , sum the atomic events where it is true:

Page 26: CS 541: Artificial Intelligence

Start with the joint distribution:

For any proposition , sum the atomic events where it is true:

Inference by enumeration

Page 27: CS 541: Artificial Intelligence

Start with the joint distribution:

General idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables

Normalization

Page 28: CS 541: Artificial Intelligence

Inference by enumeration Let be all the variables. Typically, we want

The posterior joint distribution of the query variables , given specific values for the evidence variables

Let the hidden variables be Then the required summation of joint entries is done by

summing out the hidden variables:

The terms in the summation are joint entries because , , and together exhaust the set of random variables

Obvious problems: 1) Worst-case time complexity where is the largest arity 2) Space complexity to store the joint distribution 3) How to find the numbers for entries???

Page 29: CS 541: Artificial Intelligence

Independence

Page 30: CS 541: Artificial Intelligence

Independence A and B are independent iff or or

16 entries reduced to 12; for independent biased coins, Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, but none

of which are independent. What to do?

Page 31: CS 541: Artificial Intelligence

Conditional independence has independent entries If I have a cavity, the probability that the probe catches in it

doesn't depend on whether I have a toothache: (1)

The same independence holds if I haven't got a cavity: (2)

is conditionally independent of given :

Equivalent statements:

Page 32: CS 541: Artificial Intelligence

Conditional independence Write out full joint distribution using chain rule:

I.e., 2+2+1 = 5 independent numbers (equations 1 and 2 remove 2)

In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in to linear in .

Conditional independence is our most basic and robust form of knowledge about uncertain environments.

Page 33: CS 541: Artificial Intelligence

Bayes’ Rule

Page 34: CS 541: Artificial Intelligence

Bayes’ Rule Product rule Bayes' rule Or in distribution form Useful for assessing diagnostic probability from

causal probability: E.g., let be meningitis, be stiff neck:

Note: posterior probability of meningitis is still very small!

Page 35: CS 541: Artificial Intelligence

Bayes' Rule and conditional independence

This is an example of a naive Bayes model:

Total number of parameters is linear in

Page 36: CS 541: Artificial Intelligence

Wumpus world

iff [, ] contains a pit iff [, ] is breezy Include only in the probability model

Page 37: CS 541: Artificial Intelligence

The full joint distribution is Apply product rule: Do it this way to get . First term: if pits are adjacent to breezes,

otherwise Second term: pits are placed randomly, probability

0.2 per square: for pits.

Specifying the probability model

Page 38: CS 541: Artificial Intelligence

We know the following facts:

Query is Define s other than and For inference by enumeration, we have

Grows exponentially with number of squares!

Observation and query

Page 39: CS 541: Artificial Intelligence

Using conditional independence Basic insight: observations are conditionally

independent of other hidden squares given neighboring hidden squares

Define

Manipulate query into a form where we can use this

Page 40: CS 541: Artificial Intelligence

Using conditional independence

Page 41: CS 541: Artificial Intelligence

Using conditional independence

Page 42: CS 541: Artificial Intelligence

Summary Probability is a rigorous formalism for uncertain

knowledge Joint probability distribution species probability of

every atomic event Queries can be answered by summing over atomic

events For nontrivial domains, we must find a way to reduce

the joint size Independence and conditional independence provide

the tools

Page 43: CS 541: Artificial Intelligence

Bayesian Network

Lecture VI—Part II

Page 44: CS 541: Artificial Intelligence

Outline Syntax Semantics Parameterized distributions

Page 45: CS 541: Artificial Intelligence

Bayesian networks A simple, graphical notation for conditional

independence assertions, and hence for compact specification of full joint distributions

Syntax: A set of nodes, one per variable A directed, acyclic graph (link"directly influences") A conditional distribution for each node given its parents:

In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over for each combination of parent values

Page 46: CS 541: Artificial Intelligence

Examples Topology of network encodes conditional

independence assertions:

is independent of the other variables and are conditionally independent given Cavity

Page 47: CS 541: Artificial Intelligence

Example I'm at work, neighbor John calls to say my alarm is

ringing, but neighbor Mary doesn't call. Sometimes it's set by minor earthquakes. Is there a burglar?

Variables: , , , , Network topology reflects "causal" knowledge:

A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

Page 48: CS 541: Artificial Intelligence

Example

Page 49: CS 541: Artificial Intelligence

Compactness A CPT for Boolean with Boolean parents has rows for the

combinations of parent values Each row requires one number for

The number for is just If each variable has no more than parents, the complete network requires numbers I.e., grows linearly with , vs. for the full joint distribution For burglary net, numbers (vs. )

Page 50: CS 541: Artificial Intelligence

Compactness Global semantics define the full joint distribution as the

product of the local conditional distributions:

e.g., =

Page 51: CS 541: Artificial Intelligence

Compactness Global semantics define the full joint distribution as the

product of the local conditional distributions:

e.g., =

Page 52: CS 541: Artificial Intelligence

Local semantics Local semantics: each node is conditionally

independent of its non-descendants given its parents

Theorem:

Page 53: CS 541: Artificial Intelligence

Markov blanket Each node is conditionally independent of all

others given its Markov blanket: parents + children + children's

parents

Page 54: CS 541: Artificial Intelligence

Constructing Bayesian networks Need a method such that a series of locally

testable assertions of conditional independence guarantees the required global semantics 1. Choose an ordering of variables 2. For to

add to the network select parents from such that

This choice of parents guarantees the global semantics:

(chain rule) (by construction)

Page 55: CS 541: Artificial Intelligence

Example Suppose we choose the ordering , , , ,

?

Page 56: CS 541: Artificial Intelligence

Example Suppose we choose the ordering , , , ,

? No? ?

Page 57: CS 541: Artificial Intelligence

Suppose we choose the ordering , , , ,

? No? ? No? ?

Example

Page 58: CS 541: Artificial Intelligence

Suppose we choose the ordering , , , ,

? No? ? No? Yes? No??

Example

Page 59: CS 541: Artificial Intelligence

Suppose we choose the ordering , , , ,

? No? ? No? No? No? No? Yes

Example

Page 60: CS 541: Artificial Intelligence

Deciding conditional independence is hard in noncausal directions (Causal models and conditional independence seem hardwired for

humans!) Assessing conditional probabilities is hard in noncausal directions Network is less compact: numbers needed

Example

Page 61: CS 541: Artificial Intelligence

Example: Car diagnosis Initial evidence: car won't start Testable variables (green), "broken, so fix it" variables (orange) Hidden variables (gray) ensure sparse structure, reduce parameters

Page 62: CS 541: Artificial Intelligence

Example: Car insurance

Page 63: CS 541: Artificial Intelligence

Compact conditional distributions CPT grows exponentially with number of parents CPT becomes infinite with continuous-valued

parent or child Solution: canonical distributions that are defined

compactly Deterministic nodes are the simplest case:

for some function E.g., Boolean functions

E.g., numerical relationships among continuous variables

Page 64: CS 541: Artificial Intelligence

Compact conditional distributions Noisy-OR distributions model multiple non-interacting

causes 1) Parents include all causes (can add leak node) 2) Independent failure probability for each cause alone

Number of parameters linear in number of parents

Page 65: CS 541: Artificial Intelligence

Hybrid (discrete+continuous) networks Discrete (? and ?); continuous ( and )

Option 1: discretization--possibly large errors, large CPTs Option 2: finitely parameterized canonical families 1) Continuous variable, discrete+continuous parents (e.g., ) 2) Discrete variable, continuous parents (e.g., ?)

Page 66: CS 541: Artificial Intelligence

Continuous child variables Need one conditional density function for child

variable given continuous parents, for each possible assignment to discrete parents

Most common is the linear Gaussian model, e.g.,: Mean cost varies linearly with , variance is fixed Linear variation is unreasonable over the full range

but works OK if the likely range of Harvest is narrow

Page 67: CS 541: Artificial Intelligence

Continuous child variables

All-continuous network with LG distributions full joint distribution is a multivariate Gaussian

Discrete+continuous LG network is a conditional Gaussian network i.e., a multivariate Gaussian over all continuous variables for each combination of discrete variable values

Page 68: CS 541: Artificial Intelligence

Discrete variable w/ continuous parents Probability of ? given Cost should be a "soft"

threshold:

Probit distribution uses integral of Gaussian:

Page 69: CS 541: Artificial Intelligence

Why the probit 1. It's sort of the right shape 2. Can view as hard threshold whose location is subject to noise

Page 70: CS 541: Artificial Intelligence

Discrete variable Sigmoid (or logit) distribution also used in neural networks: Sigmoid has similar shape to probit but much longer tails:

Page 71: CS 541: Artificial Intelligence

Summary Bayes nets provide a natural representation for (causally induced)

conditional independence Topology + CPTs = compact representation of joint distribution Generally easy for (non)experts to construct Canonical distributions (e.g., noisy-OR) = compact representation of CPTs Continuous variables parameterized distributions (e.g., linear Gaussian)