Upload
kishi
View
22
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CS 541: Artificial Intelligence. Lecture VI: Modeling Uncertainty and Bayesian Network. Schedule. CA: Yizhou Lin Email Address : [email protected]. Re-cap of Lecture IV – Part I. Inference in First-order logic: Reducing first-order inference to propositional inference Unification - PowerPoint PPT Presentation
Citation preview
CS 541: Artificial Intelligence
Lecture VI: Modeling Uncertainty and Bayesian Network
Schedule
CA: Yizhou Lin Email Address: [email protected]
Week Date Topic Reading Homework**
1 08/29/2012 Introduction & Intelligent Agent Ch 1 & 2 N/A
2 09/05/2012 Search: search strategy and heuristic search Ch 3 & 4s HW1 (Search)
3 09/12/2012 Search: Constraint Satisfaction & Adversarial Search Ch 4s & 5 & 6 Teaming Due
4 09/19/2012 Logic: Logic Agent & First Order Logic Ch 7 & 8s HW1 due, Midterm Project (Game)
5 09/26/2012 Logic: Inference on First Order Logic Ch 8s & 9 6 10/03/2012 No class 7 10/10/2012 Uncertainty and Bayesian Network Ch 13 & Ch14s HW2 (Logic)
8 10/17/2012 Midterm Presentation Midterm Project Due
9 10/24/2012 Inference in Baysian Network Ch 14s HW2 Due, HW3 (Probabilistic Reasoning)
10 10/31/2012 Probabilistic Reasoning over Time Ch 15 11 11/07/2012 Machine Learning HW3 due,
12 11/14/2012 Markov Decision Process Ch 18 & 20 HW4 (Probabilistic Reasoning Over Time)
13 11/21/2012 No class Ch 16 14 11/29/2012 Reinforcement learning Ch 21 HW4 due
15 12/05/2012 Final Project Presentation Final Project Due
Re-cap of Lecture IV – Part I Inference in First-order logic:
Reducing first-order inference to propositional inference Unification Generalized Modus Ponens Forward and backward chaining Logic programming Resolution: rules to transform to CNF
Modeling Uncertainty
Lecture VI—Part I
Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes’ Rule
Uncertainty
Uncertainty Let action At=leave for airport t minutes before flight Will At get me there on time? Problems:
1) Partial observability (road state, other drivers' plans, etc.) 2) Noisy sensors (KCBS traffic reports) 3) Uncertainty in action outcomes (flat tire, etc.) 4) Immense complexity of modelling and predicting traffic
Hence a purely logical approach either 1) Risks falsehood: " A25 will get me there on time" 2) Leads to conclusions that are too weak for decision making: A25 will get
me there on time if there's no accident on the bridge, and it doesn't rain and my tires remain intact etc etc."
A1440 might reasonably be said to get me there on time but I'd have to stay overnight in the airport …
Methods for handling uncertainty Default or nonmonotonic logic:
Assume my car does not have a flat tire Assume A25 works unless contradicted by evidence
Issues: What assumptions are reasonable? How to handle contradiction? Rules with fudge factors:
Issues: Problems with combination, e.g., Sprinkler causes Rain?? Probability
Given the available evidence, A25 will get me there on time with probability 0.04 Mahaviracarya (9th Centuary.), Cardamo (1565) theory of gambling Fuzzy logic handles degree of truth NOT uncertainty e.g.,
WetGrass is true to degree 0.2
Probability
Probability Probabilistic assertions summarize effects of
Laziness: failure to enumerate exceptions, qualifications, etc. Ignorance: lack of relevant facts, initial conditions, etc.
Subjective or Bayesian probability: Probabilities relate propositions to one's own state of
knowledge E.g.,
These are not claims of a "probabilistic tendency" in the current situation (but might be learned from past experience of similar
situations) Probabilities of propositions change with new evidence:
E.g., (Analogous to logical entailment status , not truth.)
Making decisions under uncertainty Suppose I believe the following:
Which action to choose? Depends on my preferences for missing flight vs.
airport cuisine, etc. Utility theory is used to represent and infer preferences Decision theory = utility theory + probability theory
Probability basics Begin with a set -- the sample space
E.g., 6 possible rolls of a die. is a sample point/possible world/atomic event
A probability space or probability model is a sample space with an assignment for every s.t.
, e.g., . An event A is any subset of
E.g.,
Random variables A random variable is a function from sample points
to some range, e.g., the reals or Booleans E.g., .
P induces a probability distribution for any r.v. X:
e.g.,
Propositions Think of a proposition as the event (set of sample points), where
the proposition is true Given Boolean random variables A and B:
event = set of sample points where event = set of sample points where event = points where and
Often in AI applications, the sample points are defined by the values of a set of random variables, I.e., the sample space is the Cartesian product of the ranges of
the variables With Boolean variables, sample point = propositional logic model
E.g., , , or . Proposition = disjunction of atomic events in which it is true
E.g.,
Why use probability The definitions imply that certain logically related
events must have related probabilities E.g.,
de Finetti (1931): an agent who bets according to probabilities that violate these axioms can be forced to bet so as to lose money regardless of outcome.
Syntax for propositions
Syntax for propositions Propositional or Boolean random variables
E.g., (do I have a cavity?) is a proposition, also written
Discrete random variables (finite or infinite) E.g., is one of is a proposition
Values must be exhaustive and mutually exclusive Continuous random variables (bounded or
unbounded) E.g., ; also allow, e.g., .
Arbitrary Boolean combinations of basic propositions
Prior probability Prior or unconditional probabilities of propositions
E.g., and correspond to belief prior to arrival of any (new) evidence Probability distribution gives values for all possible assignments:
(normalized, i.e., sums to 1) Joint probability distribution for a set of r.v.s gives the probability of
every atomic event on those r.v.s (i.e., every sample point) = a matrix of values:
Every question about a domain can be answered by the joint distribution because every event is a sum of sample points
Probability for continuous variables Express distribution as a parameterized function of value:
= uniform density between and
Here is a density; integrates to . I.e.,really means
Gaussian density
𝑃 (𝑥 )= 1√2𝜋 𝜎
𝑒− (𝑥−𝜇 )2 / 2𝜎2
Conditional probability Conditional or posterior probabilities
E.g., i.e., given that toothache is all I know NOT "if then 80% chance of "
Notation for conditional distributions: = 2-element vector of 2-element vectors
If we know more, e.g., is also given, then we have
Note: the less specific belief remains valid after more evidence arrives, but is not always useful
New evidence may be irrelevant, allowing simplification, e.g.,
This kind of inference, sanctioned by domain knowledge, is crucial
Conditional probability Definition of conditional probability:
Product rule gives an alternative formulation:
A general version holds for whole distributions, e.g.,
(View as a set of equations, not matrix mult.) Chain rule is derived by successive application of product rule:
Inference
Inference by enumeration Start with the joint distribution:
For any proposition , sum the atomic events where it is true:
Inference by enumeration Start with the joint distribution:
For any proposition , sum the atomic events where it is true:
Start with the joint distribution:
For any proposition , sum the atomic events where it is true:
Inference by enumeration
Start with the joint distribution:
General idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables
Normalization
Inference by enumeration Let be all the variables. Typically, we want
The posterior joint distribution of the query variables , given specific values for the evidence variables
Let the hidden variables be Then the required summation of joint entries is done by
summing out the hidden variables:
The terms in the summation are joint entries because , , and together exhaust the set of random variables
Obvious problems: 1) Worst-case time complexity where is the largest arity 2) Space complexity to store the joint distribution 3) How to find the numbers for entries???
Independence
Independence A and B are independent iff or or
16 entries reduced to 12; for independent biased coins, Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, but none
of which are independent. What to do?
Conditional independence has independent entries If I have a cavity, the probability that the probe catches in it
doesn't depend on whether I have a toothache: (1)
The same independence holds if I haven't got a cavity: (2)
is conditionally independent of given :
Equivalent statements:
Conditional independence Write out full joint distribution using chain rule:
I.e., 2+2+1 = 5 independent numbers (equations 1 and 2 remove 2)
In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in to linear in .
Conditional independence is our most basic and robust form of knowledge about uncertain environments.
Bayes’ Rule
Bayes’ Rule Product rule Bayes' rule Or in distribution form Useful for assessing diagnostic probability from
causal probability: E.g., let be meningitis, be stiff neck:
Note: posterior probability of meningitis is still very small!
Bayes' Rule and conditional independence
This is an example of a naive Bayes model:
Total number of parameters is linear in
Wumpus world
iff [, ] contains a pit iff [, ] is breezy Include only in the probability model
The full joint distribution is Apply product rule: Do it this way to get . First term: if pits are adjacent to breezes,
otherwise Second term: pits are placed randomly, probability
0.2 per square: for pits.
Specifying the probability model
We know the following facts:
Query is Define s other than and For inference by enumeration, we have
Grows exponentially with number of squares!
Observation and query
Using conditional independence Basic insight: observations are conditionally
independent of other hidden squares given neighboring hidden squares
Define
Manipulate query into a form where we can use this
Using conditional independence
Using conditional independence
Summary Probability is a rigorous formalism for uncertain
knowledge Joint probability distribution species probability of
every atomic event Queries can be answered by summing over atomic
events For nontrivial domains, we must find a way to reduce
the joint size Independence and conditional independence provide
the tools
Bayesian Network
Lecture VI—Part II
Outline Syntax Semantics Parameterized distributions
Bayesian networks A simple, graphical notation for conditional
independence assertions, and hence for compact specification of full joint distributions
Syntax: A set of nodes, one per variable A directed, acyclic graph (link"directly influences") A conditional distribution for each node given its parents:
In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over for each combination of parent values
Examples Topology of network encodes conditional
independence assertions:
is independent of the other variables and are conditionally independent given Cavity
Example I'm at work, neighbor John calls to say my alarm is
ringing, but neighbor Mary doesn't call. Sometimes it's set by minor earthquakes. Is there a burglar?
Variables: , , , , Network topology reflects "causal" knowledge:
A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call
Example
Compactness A CPT for Boolean with Boolean parents has rows for the
combinations of parent values Each row requires one number for
The number for is just If each variable has no more than parents, the complete network requires numbers I.e., grows linearly with , vs. for the full joint distribution For burglary net, numbers (vs. )
Compactness Global semantics define the full joint distribution as the
product of the local conditional distributions:
e.g., =
Compactness Global semantics define the full joint distribution as the
product of the local conditional distributions:
e.g., =
Local semantics Local semantics: each node is conditionally
independent of its non-descendants given its parents
Theorem:
Markov blanket Each node is conditionally independent of all
others given its Markov blanket: parents + children + children's
parents
Constructing Bayesian networks Need a method such that a series of locally
testable assertions of conditional independence guarantees the required global semantics 1. Choose an ordering of variables 2. For to
add to the network select parents from such that
This choice of parents guarantees the global semantics:
(chain rule) (by construction)
Example Suppose we choose the ordering , , , ,
?
Example Suppose we choose the ordering , , , ,
? No? ?
Suppose we choose the ordering , , , ,
? No? ? No? ?
Example
Suppose we choose the ordering , , , ,
? No? ? No? Yes? No??
Example
Suppose we choose the ordering , , , ,
? No? ? No? No? No? No? Yes
Example
Deciding conditional independence is hard in noncausal directions (Causal models and conditional independence seem hardwired for
humans!) Assessing conditional probabilities is hard in noncausal directions Network is less compact: numbers needed
Example
Example: Car diagnosis Initial evidence: car won't start Testable variables (green), "broken, so fix it" variables (orange) Hidden variables (gray) ensure sparse structure, reduce parameters
Example: Car insurance
Compact conditional distributions CPT grows exponentially with number of parents CPT becomes infinite with continuous-valued
parent or child Solution: canonical distributions that are defined
compactly Deterministic nodes are the simplest case:
for some function E.g., Boolean functions
E.g., numerical relationships among continuous variables
Compact conditional distributions Noisy-OR distributions model multiple non-interacting
causes 1) Parents include all causes (can add leak node) 2) Independent failure probability for each cause alone
Number of parameters linear in number of parents
Hybrid (discrete+continuous) networks Discrete (? and ?); continuous ( and )
Option 1: discretization--possibly large errors, large CPTs Option 2: finitely parameterized canonical families 1) Continuous variable, discrete+continuous parents (e.g., ) 2) Discrete variable, continuous parents (e.g., ?)
Continuous child variables Need one conditional density function for child
variable given continuous parents, for each possible assignment to discrete parents
Most common is the linear Gaussian model, e.g.,: Mean cost varies linearly with , variance is fixed Linear variation is unreasonable over the full range
but works OK if the likely range of Harvest is narrow
Continuous child variables
All-continuous network with LG distributions full joint distribution is a multivariate Gaussian
Discrete+continuous LG network is a conditional Gaussian network i.e., a multivariate Gaussian over all continuous variables for each combination of discrete variable values
Discrete variable w/ continuous parents Probability of ? given Cost should be a "soft"
threshold:
Probit distribution uses integral of Gaussian:
Why the probit 1. It's sort of the right shape 2. Can view as hard threshold whose location is subject to noise
Discrete variable Sigmoid (or logit) distribution also used in neural networks: Sigmoid has similar shape to probit but much longer tails:
Summary Bayes nets provide a natural representation for (causally induced)
conditional independence Topology + CPTs = compact representation of joint distribution Generally easy for (non)experts to construct Canonical distributions (e.g., noisy-OR) = compact representation of CPTs Continuous variables parameterized distributions (e.g., linear Gaussian)