Ai99 Tutorial 4

Nicholson & Korb

1

Nicholson & Korb

2

Overview Bayesian AI1. Introduction to Bayesian AI (20 min) AI99, Sydney 6 December 1999 Ann E. Nicholson and Kevin B. Korb School of Computer Science and Software Engineering Monash University, Clayton, VIC 3168 AUSTRALIA 2. Bayesian networks (50 min) Break (10 min) 3. Applications (50 min) Break (10 min) 4. Learning Bayesian networks (50 min) 5. Current research issues (10 min) 6. Bayesian Net Lab (60 min: Optional) 7. Dinner (Optional)

fannn,[email protected]

Bayesian AI Tutorial


Nicholson & Korb

3

Nicholson & Korb

4

Introduction to Bayesian AIReasoning under uncertainty Probabilities Alternative formalisms Fuzzy logic MYCINs certainty factors Default Logic Bayesian philosophy Dutch book arguments Bayes Theorem Conditionalization Conrmation theory Bayesian decision theory Towards a Bayesian AI

Reasoning under uncertaintyUncertainty: The quality or state of being not clearly known. This encompasses most of what we understand about the world and most of what we would like our AI systems to understand. Distinguishes deductive knowledge (e.g., mathematics) from inductive belief (e.g., science). Sources of uncertainty Ignorance (which side of this coin is up?) Physical randomness (which side of this coin will land up?) Vagueness (which tribe am I closest to genetically? Picts? Angles? Saxons? Celts?)Bayesian AI Tutorial


Nicholson & Korb

5

Nicholson & Korb

6

Fuzzy Logic ProbabilitiesClassic approach to reasoning under uncertainty. (Blaise Pascal and Fermat). Kolmogorovs Axioms: 1. 2. 3. Designed to cope with vagueness: Is Fido a Labrador or a Shepard?

m(Fido 2 Labrador) = m(Fido 2 Shepard) = 0:5Extended to fuzzy logic, which takes intermediate truth values: T (Labrador(Fido)) = 0:5. Combination rules:

Fuzzy set theory:

P (U ) = 1

8X U P (X ) 0 8X; Y U if X \ Y = ; then P (X ^ Y ) = P (X ) + P (Y ) q Y i P (X jY ) = P (X )P (X ^Y ) P (Y )

T (p ^ q) = min(T (p); T (q)) T (p _ q) = max(T (p); T (q)) T (:p) = 1 T (p)Not suitable for coping with randomness or ignorance. Obviously not: Uncertainty(inclement weather) = max(Uncertainty(rain),Uncertainty(hail),. . . )

Conditional Probability P (X jY ) = Independence X



Nicholson & Korb

7

Nicholson & Korb

8

MYCINs Certainty Factors Default LogicUncertainty formalism developed for the early expert system MYCIN (Buchanan and Shortliffe, 1984): Elicit for (h; e): Intended to reect stereotypical reasoning under uncertainty (Reiter 1980). Example:

MB (h; e) 2 0; 1] measure of disbelief: MD(h; e) 2 0; 1]measure of belief:

Bird(Tweety) : Bird(x) ! Flies(x) Flies(Tweety)Problems: Best semantics for default rules are probabilistic (Pearl 1988, Korb 1995). Mishandles combinations of low probability events. E.g.,

CF (h; e) = MB (h; e) MD(h; e) 2 1; 1]Special functions provided for combining evidence. Problems: No semantics ever given for belief/disbelief Heckerman (1986) proved that restrictions required for a probabilistic semantics imply absurd independence assumptions.Bayesian AI Tutorial

ApplyforJob(me) : ApplyforJob(x) ! Reject(x) Reject(me)I.e., the dole always looks better than applying for a job!


Nicholson & Korb

9

Nicholson & Korb

10

Probability TheorySo, why not use probability theory to represent uncertainty? Thats what it was invented for. . . dealing with physical randomness and degrees of ignorance. Furthermore, if you make bets which violate probability theory, you are subject to Dutch books: A Dutch book is a sequence of fair bets which collectively guarantee a loss. Fair bets are bets based upon the standard odds-probability relation:

A Dutch BookPayoff table on a bet for h (Odds = p=1 p; S = betting unit) h T F Payoff $(1-p) -$p S S

Given a fair bet, the expected value from such a payoff is always $0. Now, lets violate the probability axioms. Example Say, P (A) =

0:1 (violating A2)

O(h) = 1 P (h()h) P O P (h) = 1 + (h()h) OBayesian AI Tutorial

Payoff table against A (inverse of: for A), with S = 1:

:AT F

Payoff $pS = -$0.10 -$(1-p)S = -$1.10Bayesian AI Tutorial

Nicholson & Korb

11

Nicholson & Korb

12

Bayes Theorem; Conditionalization Due to Reverend Thomas Bayes (1764)

Bayesian Decision Theory Frank Ramsey (1931) Decision making under uncertainty: what action to take (plan to adopt) when future state of the world is not known. Bayesian answer: Find utility of each possible outcome (action-state pair) and take the action that maximizes expected utility. ExampleAction Take umbrella Leave umbrella Expected utilities: E(Take umbrella) = (30)(.4) + (10)(.6) = 18 E(Leave umbrella) = (-50)(.4) + (100)(.6) = 40 Rain (p = .4) 30 -100 Shine (1 - p = .6) 10 50

j P (hje) = P (ePh()eP (h) )Conditionalization:

P 0 (h) = P (hje)

Or, read Bayes theorem as:

Posterior = Likelihood Prior Prob of evidenceAssumptions: 1. Joint priors over fhi g and e exist. 2. Total evidence: e, and only e, is learned.



Nicholson & Korb

13

Nicholson & Korb

14

Bayesian AIA Bayesian conception of an AI is: An autonomous agent which Has a utility structure (preferences) Can learn about its world and the relation between its actions and future states (probabilities) Maximizes its expected utility The techniques used in learning about the world are (primarily) statistical. . . Hence Bayesian data mining

Bayesian Networks: OverviewSyntax Semantics Evaluation methods Inuence diagrams (Decision Networks) Dynamic Bayesian Networks



Nicholson & Korb

15

Nicholson & Korb

16

Bayesian NetworksData Structure which represents the dependence between variables; Gives concise specication of the joint probability distribution. A Bayesian Network is a graph in which the following holds: 1. A set of random variables makes up the nodes in the network. 2. A set of directed links or arrows connects pairs of nodes. 3. Each node has a conditional probability table that quanties the effects the parents have on the node. 4. Directed, acyclic graph (DAG), i.e. no directed cycles.

Example: Earthquake (Pearl,R&N)You have a new burglar alarm installed. It is reliable about detecting burglary, but responds to minor earthquakes. Two neighbours (John, Mary) promise to call you at work when they hear the alarm. John always calls when hears alarm, but confuses alarm with phone ringing (and calls then also) Mary likes loud music and sometimes misses alarm! Given evidence about who has and hasnt called, estimate the probability of a burglary.



Nicholson & Korb

17

Nicholson & Korb

18

Earthquake Example: Notes Earthquake Example: Network StructureAssumptions: John and Mary dont perceive burglary directly; they do not feel minor earthquakes. Note: no info about loud music or telephone ringing and confusing John. Summarised in uncertainty in links from Alarm to JohnCalls and MaryCalls. Once specied topology, need to specify conditional probability table (CPT) for each node. Each row contains the cond prob of each node value for a conditioning case. Each row must sum to 1. A table for a Boolean var with n Boolean parents contain 2n+1 probs. A node with no parents has one row (the prior probabilities)Bayesian AI Tutorial Bayesian AI Tutorial

Burglary P(B) 0.01 Alarm

Earthquake

P(E) 0.02 B E P(A|B,E) 0.95 0.94 0.29 0.001

JohnCalls A T F P(J|A) 0.90 0.05

MaryCalls A T F

T T T F F T F F

P(M|A) 0.70 0.01

Nicholson & Korb

19

Nicholson & Korb

20

Representing the joint probability distribution Semantics of Bayesian NetworksA (more compact) representation of the joint probability distribution. helpful in understanding how to construct network Encoding a collection of conditional independence statements. helpful in understanding how to design inference procedures

P (X1 = x1 ; X2 = x2 ; :::; Xn = xn ) = P (x1 ; x2 ; :::; xn)= P (x1 ) P (x2 jx1 )::: P (xn jx1 ^ :::xn 1 ) = =i P (xi jx1 ^ :::xi 1 ) i P (xi j (Xi ))

Example:

P (J ^ M ^ A ^ :B ^ :E )

= P (J jA)P (M jA)P (Aj:B ^ :E )P (:B )P (:E ) = 0:9 0:7 0:001 0:999 0:998 = 0:0067:



Nicholson & Korb

21

Nicholson & Korb

22

Network Construction1. Choose the set of relevant variables Xi that describe the domain. 2. Choose an ordering for the variables. 3. While there are variables left: (a) Pick a variable Xi and add a node to the network for it. (b) Set (Xi ) to some minimal set of nodes already in the net such that the conditional independence property is satised. (c)

Compactness and Node OrderingCompactness of BN is an example of a locally structured (or sparse) system. The correct order to add nodes is to add the root causes rst, then the variable they inuence, so on until leaves reached. Examples of wrong ordering (which still represent same joint distribution): 1. MaryCalls, JohnCalls, Alarm, Burglary, Earthquake.MaryCalls JohnCalls

P (XijXi 1; :::; X1) = P (Xij (Xi)) Dene the CPT for Xi .

Alarm

Burglary

Earthquake



Nicholson & Korb

23

Nicholson & Korb

24

Compactness and Node Ordering (cont.)2. MaryCalls, JohnCalls, Earthquake, Burglary, Alarm.MaryCalls JohnCalls

Conditional Independence: Causal Chains

Causal chains give rise to conditional independence:

A

B

C

Earthquake

P (C jA ^ B ) = P (C jB )Alarm

Burglary

Example More probabilities than the full joint! See below for why. C = Jills u A = Jacks u B = severe cough



Nicholson & Korb

25

Nicholson & Korb

26

Conditional Independence: Common Causes

Conditional Dependence: Common Effects

Causal causes (or ancestors) also give rise to conditional independence:

Common effects (or their descendants) give rise to conditional dependence:A C

B B

A

C

P (AjC ^ B ) 6= P (A)P (C )Example A = u B = severe cough C = tuberculosis Given a severe cough, u explains away tuberculosis.

P (C jA ^ B ) = P (C jB )Example A = Jacks u B = Joes u C = Jills uBayesian AI Tutorial


Nicholson & Korb

27

Nicholson & Korb

28

D-separationGraph-theoretic criterion of conditional independence. We can determine whether a set of nodes X is independent of another set Y, given a set of evidence nodes E, i.e., X q Y jE . Earthquake exampleBurglary Earthquake

Causal Ordering

Why does variable order affect network density? Because Using the causal order allows direct representation of conditional independencies Violating causal order requires new arcs to re-establish conditional independencies

Alarm

JohnCalls

MaryCalls



Nicholson & Korb

29

Nicholson & Korb

30

Causal Ordering (contd)

Inference in Bayesian NetworksBasic task for any probabilistic inference system: Compute the posterior probability distribution for a set of query variables, given values for some evidence variables. Also called Belief Updating. Types of Inference:

Flu

TB

Cough

Flu and TB are marginally independent. Given the ordering: Cough, Flu, TB:

QCough

E

Q

E

E

Flu

TB

E (Explaining Away) Intercausal

Q

Marginal independence of Flu and TB must be TB re-established by adding Flu ! TB or Flu

E Diagnostic

Q Causal

E Mixed



Nicholson & Korb

31

Nicholson & Korb

32

Kinds of InferenceDiagnostic inferences: from effect to causes. P(Burglary|JohnCalls) Causal Inferences: from causes to effects. P(JohnCalls|Burglary) P(MaryCalls|Burglary) Intercausal Inferences: between causes of a common effect. P(Burglary|Alarm) P(Burglary|Alarm Exact inference Trees and polytrees: message-passing algorithm Multiply-connected networks: Clustering Approximate Inference

Inference Algorithms: Overview

^

Earthquake)

Large, complex networks: Stochastic Simulation Other approximation methods In the general case, both sorts of inference are computationally complex (NP-hard).

Mixed Inference: combining two or more of above. P(Alarm|JohnCalls

^ :EarthQuake) P(Burglary|JohnCalls ^ :EarthQuake)Bayesian AI Tutorial


Nicholson & Korb

33

Nicholson & Korb

34

Message Passing ExampleP(B) 0.01 Burglary Earthquake P(E) 0.02 B E P(A) 0.95 0.94 0.29 0.001

Inference in multiply connected networksNetworks where two nodes are connected by more than one path Two or more possible causes which share a common ancestor One variable can inuence another through more than one causal mechanism Example: Cancer networkMetastatic Cancer

PhoneRings P(Ph) 0.05 JohnCalls P A P(J) 0.95 0.5 0.90 0.01

Alarm

MaryCalls A T F P(M) 0.70 0.01

T T T F F T F F

T T T F F T F F

() = (.001,.999) () = (1,1) bel(B) = (.001, .999)B

() = (.002,.998) () = (1,1) bel(E) = (.002, .998) A (E)E

A Brain tumour B Increased total serum calcium C D Coma E Severe Headaches

A (B)

bel(Ph) = (.05, .95) (Ph) = (.05,.95) (Ph) = (1,1)Ph

A (B) J (Ph) J (A)A

A (E) M(A)

J (Ph)

M(A) J (A)

J

M

(J) = (1,1)

(M) = (1,0)

Message passing doesnt work - evidence gets counted twiceBayesian AI Tutorial


Nicholson & Korb

35

Nicholson & Korb

36

Clustering methods Clustering methods (cont.)Transform network into a probabilistically equivalent polytree by merging (clustering) offending nodes Cancer example: new node Z combining B and C Jensen Join-tree (Jensen, 1996) version the current most efcient algorithm in this class (e.g. used in Hugin, Netica). Network evaluation done in two stages

A

Compile into join-tree May be slow May require too much memory if original network is highly connected

Z=B,C E D

Do belief updating in join-tree (usually fast) Caveat: clustered nodes have increased complexity; updates may be computationally complex

P (z ja) = P (b; cja) = P (bja)P (cja) P (ejz ) = P (ejb; c) = P (ejc) P (djz ) = P (djb; c)Bayesian AI Tutorial


Nicholson & Korb

37

Nicholson & Korb

38

Approximate inference with stochastic simulationUse the network to generate a large number of cases that are consistent with the network distribution. Evaluation may not converge to exact values (in reasonable time). Usually converges to close to exact solution quickly if the evidence is not too unlikely. Performs better when evidence is nearer to root nodes, however in real domains, evidence tends to be near leaves (Nicholson&Jitnah, 1998)

Making DecisionsBayesian networks can be extended to support decision making. Preferences between different outcomes of various plans. Utility theory Decision theory = Utility theory + Probability theory.



Nicholson & Korb

39

Nicholson & Korb

40

Type of Nodes Decision NetworksA Decision network represents information about the agents current state its possible actions the state that will result from the agents action the utility of that state Also called, Inuence Diagrams (Howard&Matheson, 1981). Chance nodes: (ovals) represent random variables (same as Bayesian networks). Has an associated CPT. Parents can be decision nodes and other chance nodes. Decision nodes: (rectangles) represent points where the decision maker has a choice of actions. Utility nodes: (diamonds) represent the agents utility function (also called value nodes in the literature). Parents are variables describing the outcome state that directly affect utility. Has an associated table representing multi-attribute utility function.



Nicholson & Korb

41

Nicholson & Korb

42

Example: UmbrellaWeather

Evaluating Decision Networks: Algorithm1. Set the evidence variables for the current state. 2. For each possible value of the decision node (a) Set the decision node to that value. (b) Calculate the posterior probabilities for the parent nodes of the utility node (as for BNs). (c) Calculate the resulting (expected) utility for the action. 3. Return the action with the highest expected utility. Simple for single decision, less so when executing several actions in sequence (i.e. a plan).

Forecast

U

Take Umbrella

P (Weather = Rainj) = 0:3 P (Forecast = RainyjWeather = Rain) = 0:60 P (Forecast = CloudyjWeather = Rain) = 0:25 P (Forecast = SunnyjWeather = Rain) = 0:15 P (Forecast = RainyjWeather = NoRain) = 0:1 P (Forecast = CloudyjWeather = NoRain) = 0:2 P (Forecast = SunnyjWeather = NoRain) = 0:7 U (NoRain; TakeUmbrella) = 20 U (NoRain; LeaveAtHome) = 100 U (Rain; TakeIt) = 70 U (Rain; LeaveAtHome) = 0Bayesian AI Tutorial


Nicholson & Korb

43

Nicholson & Korb

44

Dynamic Belief Networks

State evolution model

Dynamic Decision NetworkState t State t+1 State t+2 Obs t Obs t+1Sensor model

State t-2

State t-1

Obs t-2

Obs t-1

Obs t+2

Similarly, Decision Networks can be extended to include temporal aspects. Sequence of decisions taken = Plan.

The values of state variables at time t depend only on the values at t 1. Can calculate distributions for St+1 and further: probabilistic projection. Can be done using standard BN updating algorithms This type of DBN gets very large, very quickly. Usually only keep two time slices of the network.

Dt

Dt+1

Dt+1

Dt+1

State t

State t+1

State t+2

State t+3

Ut+3 Obs t Obs t+1 Obs t+2 Obs t+3



Nicholson & Korb

45

Nicholson & Korb

46

Bayesian Networks: Summary Uses of Bayesian Networks1. Calculating the belief in query variables given values for evidence variables (above). 2. Predicting values in dependent variables given values for independent variables. 3. Decision making based on probabilities in the network and on the agents utilities (Inuence Diagrams [Howard and Matheson 1981]). 4. Deciding which additional evidence should be observed in order to gain useful information. 5. Sensitivity analysis to test impact of changes in probabilities or utilities on decisions. Bayes rule allows unknown probabilities to be computed from known ones. Conditional independence (due to causal relationships) allows efcient updating Bayesian networks are a natural way to represent conditional independence info. links between nodes: qualitative aspects; conditional probability tables: quantitative aspects. Inference means computer the probability distribution for a set of query variables, given a set of evidence variables. Inference in Bayesian networks is very exible: can enter evidence about any node and update beliefs in any other nodes. The speed of inference in practice depends on the structure of the network: how manyBayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

47

Nicholson & Korb

48

Applications: Overview(Simple) Example Networks loops; numbers of parents; location of evidence and query nodes. Bayesian networks can be extended with decision nodes and utility nodes to support decision making: Decision Networks or Inuence Diagrams. Bayesian and Decision networks can be extended to allow explicit reasoning about changes over time. Applications Medical Decision Making: Survey of applications Planning and Plan Recognition Natural Language Generation (NAG) Bayesian poker Deployed Bayesian Networks (See Handout for details) BN Software Web Resources



Nicholson & Korb

49

Nicholson & Korb

50

Example: CancerMetastatic cancer is a possible cause of a brain tumor and is also an explanation for increased total serum calcium. In turn, either of these could explain a patient falling into a coma. Severe headache is also possibly associated with a brain tumor. (Example from (Pearl, 1988).)Metastatic Cancer A Brain tumour B Increased total serum calcium C D Coma E Severe Headaches

Example: AsiaA patient presents to a doctor with shortness of breath. The doctor considers that possibles causes are tuberculosis, lung cancer and bronchitis. Other additional information that is relevant is whether the patient has recently visited Asia (where tuberculosis is more prevalent), whether or not the patient is a smoker (which increases the chances of cancer and bronchitis). A positive xray would indicate either TB or lung cancer. (Example from (Lauritzen, 1988).)visit to Asia smoking

P (a) = 0:2 P (bja) = 0:80 P (cja) = 0:20 P (djb; c) = 0:80 P (djb; :c) = 0:80 P (ejc) = 0:80

P (bj:a) = 0:20 P (cj:a) = 0:05 P (dj:b; c) = 0:80 P (dj:b; :c) = 0:05 P (ej:c) = 0:60

tuberculosis either tub or lung cancer

lung cancer

bronchitis

positive X-ray

dyspnoea



Nicholson & Korb

51

Nicholson & Korb

52

Example: A Lecturers LifeDr. Ann Nicholson spends 60% of her work time in her ofce. The rest of her work time is spent elsewhere. When Ann is in her ofce, half the time her light is off (when she is trying to hide from students and get some real work done). When she is not in her ofce, she leaves her light on only 5% of the time. 80% of the time she is in her ofce, Ann is logged onto the computer. Because she sometimes logs onto the computer from home, 10% of the time she is not in her ofce, she is still logged onto the computer. Suppose a student checks Dr. Nicholsons login status and sees that she is logged on. What effect does this have on the students belief that Dr. Nicholsons light is on? (Example from (Nicholson, 1999))

Probabilistic reasoning in medicineSee handout from (Dean et al., 1993). Simplest tree-structured network for diagnostic reasoning H = disease hypothesis; F = ndings (symptoms, test results)

in-office

lights-on

logged-on

Multiply-connected network (QMR structure) B = background information (e.g. age, sex of patient)



Nicholson & Korb

53

Nicholson & Korb

54

Medical ApplicationsPathnder case study: see handout using material from (Russell&Norvig, 1995, pp.457-458). QMR (Quick Medical Reference): 600 diseases, 4,000 ndings, 40,000 arcs. (Dean&Wellman, 1991) MUNIN (Andreassen et al., 1989): neuromuscular disorders, about 1000 nodes; exact computation < 5 seconds. Glucose prediction and insulin dose adjustment (DBN application) (Andreassen et al., 1991). CPSC project (Pradham et al., 1994) 448 nodes, 906 links, 8254 conditional probability values LW algorithm - answers in 35 mins (1994)Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

55

Nicholson & Korb

56

Application of LW to medical diagnosis (Shwe&Cooper, 1990). Forecasting sleep apnea (Dagum et al., 1993). ALARM (Beinlich et al., 1989): 37 nodes, 42 arcs. (See Netica examples.)MinVolSet (3).976

Robot Navigation and TrackingExample of a Dynamic Decision Network Dean&Wellman, 1991.

Ventmach (4)1.158

Disconnect (2).617

PulmEmbolus(2).369 .288 .428

Intubation (3)

.141 .140

VentTube (4) 1.146 KinkedTube(4)1.180 .227 .098

PAP (3)

Shunt (2).067 .100

Press (4)1.201

VentLung (4)1.189

FiO2 (2).411 .213

VentAlv (4).805 .743

MinVol (4).891 .362

PVSat (3) InsuffAnesth (2).092

ArtCO2 (3).054 .066

Anaphylaxis (2) ExpCO2 (4) SaO2 (3).246 .239

TPR (3)

Catechol (2) LVFailure(2).547 .137 .360 .479

.470

Hypovolemia (2).538

ErrCauter (2).324

.888

HR (3)

ErrLowOutput(2).344

History (2).724

StrokeVolume (3).746

LVEDVolume(3).874

HRSat (3)

.888 .948 .324 HREKG (3)

HRBP (3)

.251

CO (3).199

CVP (3)

PCPW (3).485

BP (3)



Nicholson & Korb

57

Nicholson & Korb

58

Plan Recognition ApplicationsKeyhole plan recognition in an Adventure game (Albrecht et al., 1998).A 0 A 1 A 2 A 3 A 0 A 1 A 2 A 3

Trafc Monitoring: BATmobile(Forbes et al., 1995) Example of a DBN

Q

Q

Q

Q

L 0

L 1

L 2

L 3

L 0

L 1

L 2

L 3

(a) mainModelA 0 A 1 A 2 A 3

(b) indepModelQ Q

Q

Q

L 0

L 1

L 2

L 3

(c) actionModel

(d) locationModel

Trafc plan recognition (Pynadeth&Wellman, 1995).



Nicholson & Korb

59

Nicholson & Korb

60

Natural Language GenerationNAG (McConachy et al., 1999) A Nice Argument Generator uses two Bayesian networks to generate and assess natural language arguments: Normative Model: Represents our best understanding of the domain; proper (constrained) Bayesian updating, given premises. User Model: Represents our best understanding of the human; Bayesian updating modied to reect human biases (e.g., overcondence; Korb, McConachy, Zukerman, 1997). BNs are embedded in a semantic hierarchy1

Higher level %c E concepts like % EE cc motivation or ability E ccc % EE Lower level cc concepts like %% E Grade Point Average cc % EE cc Semantic + @ %% E H cc Network @ % EE B HH @ % cc 2nd layer E A B HH Q @% EQ B EA H cc A %@@ EE QQBB E Semantic cc E Network %% - @ E Q R @ E c 1st layer % EE C c % C HH % E C H H EE C H %% Bayesian % E C Network % EE %% EE 6 %

%%

Proposition, e.g., [publications authored by

person X cited >5 times]

supports attentional modeling constrained updatingBayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

61

Nicholson & Korb

62

Bayesian Poker(Korb et al., 1999) Poker is ideal for testing automated reasoning under uncertainty Physical randomisation Incomplete hand information Incomplete opponent info (strategies, blufng, etc) Bayesian networks are a good representation for complex game playing. Our Bayesian Poker Player (BPP) plays 5-Card stud poker at the level of a good amateur human player. To play: telnet indy13.cs.monash.edu.au login: poker password: maverick

Bayesian Poker BNBayesian network provides an estimate of winning at any point in the hand. Betting curves based on pot-odds used to determine action (bet/call, pass or fold).BPP Win

OPP Final

BPP Final

M C|F OPP Current

M

C|F

BPP Current

M A|C

M

U|C

OPP Action

OPP Upcards



Nicholson & Korb

63

Nicholson & Korb

64

Bayesian Poker BN (cont.)Hand Types

Bayesian Poker BN (cont.)Different networks (matrices) for each round. OPP Current, BPP Current: (partial) hand types with cards dealt so far. OPP Final, BPP Final: hand types after all 5 cards dealt. Observation nodes: OPP Upcards: All opponents cards except rst are visible to BPP. OPP Action: BPP knows opponents action.

Initial 9 hand types too coarse. We use a ner granularity for most common hands (busted and a pair): low, medium, Q-high, K-high, A-high results in 17 hand-types Conditional Probability Matrices

MAjC : probability of opponents action givencurrent hand type learned from observed showdown data.

MU jC and MC jFpoker hands.

estimated by dealing out 107

Belief Updating: Since network is a polytree, simple fast propagation updating algorithm used.Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

65

Nicholson & Korb

66

Current Status, Possible ExtensionsBPP outperforms automated opponents, is fairly even with ave amateur humans, and loses to experienced humans. Learning the OPP Action CPTs does not (yet) appear to improve performance. BN Improvements Rene action nodes Further renement of hand types Improve network structure Adding blufng to the opponent model Improved learning of opponent model More complex poker: multi-opponent games, table stake games. DBN model to represent changes over timeBayesian AI Tutorial

Deployed BNsFrom Web Site database: See handout for details. TRACS: Predicting reliability of military vehicles. Andes: intelligent tutoring system for physics. Distributed Virtual Agents advising online users on web sites. Information extraction from natural language text DXPLAIN: decision support for medical diagnosis. Illiad: teaching tool for medical students. Microsoft Health Produce: nd by symptom feature.Bayesian AI Tutorial

Nicholson & Korb

67

Nicholson & Korb

68

Weapons scheduling. Monitoring power generation. Processor fault diagnosis. Knowledge Industries applications: (a) in medicine, sleep disorders, pathology, trauma care, hand and wrist evaluations, dermatology, and home-based health evaluations (b) in capital equipment, locomotives, gas-turbine engines for aircraft and land-based power production, the space shuttle, and ofce equipment. Software debuggin. Vista: decision support system used at NASA Mission Control Center. MS: (a) Answer Wizard (Ofce 95), Information retrieval; (b) Print Troubleshooter; (c) Aladdin, troubleshooting customer support.

BN Software: IssuesFunctionality Especially application vs API Price Many free for demo versions or educational use Commercial licence costs. Availability (platforms) Quality GUI Documentation and Help Leading edge Robustness software companyBayesian AI Tutorial


Nicholson & Korb

69

Nicholson & Korb

70

BN SoftwareAnalytica: www.lumina.com Hugin: www.hugin.com

Web ResourcesBayesian Belief Network site (Russell Greiner):www.cs.ualberta.ca/ greiner/bn.html

Netica: www.norsys.com Bayesian Network Repository (Nir Friedman) Above 3 available during tutorial lab session.www-

JavaBayes: http://www.cs.cmu.edu/ javabayes/Home/ Many other packages (see next slide)

nt.cs.berkeley.edu/home/nir/public html/Repository/index.htm

Summary of BN software and links to software sites (Kevin Murphy)HTTP.CS.Berkeley.EDU/ murphyk/Bayes/bnsoft.html



Nicholson & Korb

71

Nicholson & Korb

72

Learning Bayesian NetworksLinear and Discrete Models Learning Network Parameters Linear Coefcients

Applications: SummaryVarious BN structures are available to compactly and accurately represent certain types of domain features. Bayesian networks have been used for a wide range of AI applications. Robust and easy to use Bayesian network software is now readily available.

Learning Probability Tables Learning Causal Structure Conditional Independence Learning Statistical Equivalence TETRAD II Bayesian Learning of Bayesian Networks Cooper & Herskovits: K2 Learning Variable Order Statistical Equivalence Learners Full Causal Learners Minimum Encoding Methods Lam & Bacchuss MDL learner MML metrics MML search algorithms MML Sampling Empirical Results



Nicholson & Korb

73

Nicholson & Korb

74

Linear and Discrete ModelsLinear Models: Used in biology & social sciences since Sewall Wright (1921) Linear models represent causal relationships as sets of linear functions of independent variables.X1 X2

Learning Linear ParametersMaximum likelihood methods have been available since Wrights path model analysis (1921). Equivalent methods: Simon-Blalock method (Simon, 1954; Blalock, 1964) Ordinary least squares multiple regression (OLS)

X3

Equivalently (assuming linear parameters):

X3 = a13 X1 + a23 X2 +

1

Discrete models: Bayesian nets replace vectors of linear coefcients with CPTs.



Nicholson & Korb

75

Nicholson & Korb

76

Learning Conditional Probability TablesSpiegelhalter & Lauritzen (1990): assume parameter independence each CPT cell i = a parameter in a Dirichlet distribution

for K parents

D 1; : : : ; i; : : : ; K ]i = K=1 k k

Dual log-linear and full CPT models (Neil, Wallace, Korb 1999).

prob of outcome i is

observing outcome i update D to

D 1 ; : : : ; i + 1; : : : ; K ]Others are looking at learning without parameter independence. E.g., Decision trees to learn structure within CPTs (Boutillier et al. 1996).Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

77

Nicholson & Korb

78

Learning Causal StructureThis is the real problem; parameterizing models is essentially numerical computing. There are two basic methods: Learning from conditional independencies (CI learning) Learning using a scoring metric (Metric learning)

Statistical EquivalenceVerma and Pearls rules identify the set of causal models which are statistically equivalent Two causal models H1 and H2 are statistically equivalent iff they contain the same variables and joint samples over them provide no statistical grounds for preferring one over the other. Examples All fully connected models are equivalent. A !B !C and A A !B !D B C. B !D C.

CI learning (Verma and Pearl, 1991) Suppose you have an Oracle who can answer yes or no to any question of the type:

X q Y jS?Then you can learn the correct causal model, up to statistical equivalence.Bayesian AI Tutorial

C and A


Nicholson & Korb

79

Nicholson & Korb

80

Statistical EquivalenceChickering (1995): Any two causal models over the same variables which have the same skeleton (undirected arcs) and the same directed v-structures are statistically equivalent. If H1 and H2 are statistically equivalent, then they have the same maximum likelihoods relative to any joint samples:

TETRAD II Spirtes, Glymour and Scheines (1993) Replace the Oracle with statistical tests: for linear models a signicance test on partial correlation

X q Y jS i

XY S = 0

max P (ejH1 ; 1) = max P (ejH2 ; 2)where

for discrete models a 2 test on the difference between CPT counts expected with independence (Ei ) and observed (Oi )

i is a parameterization of Hi

X q Y jS i

Oi i Oi ln Ei

2

0



Nicholson & Korb

81

Nicholson & Korb

82

Bayesian LBN: Cooper & Herskovits TETRAD IIAsymptotically nds causal structure to within the statistical equivalence class of the true model. Requires larger sample sizes than MML (Dai, Korb, Wallace & Wu, 1997): Statistical tests are not robust given weak causal interactions and/or small samples. Cheap, and easy to use. Cooper & Herskovits (1991, 1992) Compute P (hi je) by brute force, under the assumptions: 1. All variables are discrete. 2. Samples are i.i.d. 3. No missing values. 4. All values of child variables are uniformly distributed. 5. Priors over hypotheses are uniform. With these assumptions, Cooper & Herskovits reduce the computation of PCH (h; e) to a polynomial time counting problem.



Nicholson & Korb

83

Nicholson & Korb

84

Learning Variable Order Cooper & HerskovitsBut the hypothesis space is exponential; they go for dramatic simplication: 6. Assume we know the temporal ordering of the variables. Now for any pair of variables: either they are connected by an arc or they are not. Further, cycles are impossible. New hypothesis space has size only 2(n (still exponential).2

Reliance upon a given variable order is a major drawback to K2 And many other algorithms (Buntine 1991, Bouckert 1994, Suzuki 1996, Madigan & Raftery 1994) Whats wrong with that? We want autonomous AI (data mining). If experts can order the variables they can likely supply models. Determining variable ordering is half the problem. If we know A comes before B , the only remaining issue is whether there is a link between the two. The number of orderings consistent with dags is apparently exponential (Brightwell & Winkler 1990). So iterating over all possible orderings will not scale up.Bayesian AI Tutorial

n)=2

Algorithm K2 does a greedy search through this reduced space.


Nicholson & Korb

85

Nicholson & Korb

86

Statistical Equivalence LearnersHeckerman & Geiger (1995) advocate learning only up to statistical equivalence classes (a la TETRAD II). Since observational data cannot distinguish btw equivalent models, theres no point trying to go futher.

Statistical Equivalence LearnersWallace & Korb (1999): This is not right! These are causal models; they are distinguishable on experimental data. Failure to collect some data is no reason to change prior probabilities. E.g., If your thermometer topped out at 35 , you wouldnt treat 35 and 34 as equally likely. Not all equivalence classes are created equal: f A B !C, A !B !C, A B Cg f A !B Cg Within classes some dags should have greater priors than others. . . E.g., LightsOn !InOfce !LoggedOn v. LightsOn InOfce !LoggedOn

)Madigan, Andersson, Perlman &Volinsky (1996) follow this advice, use uniform prior over equivalence classes.

)Geiger and Heckerman (1994) deneBayesian metrics for linear and discrete equivalence classes of models (BGe and BDe)



Nicholson & Korb

87

Nicholson & Korb

88

Full Causal LearnersSo. . . a full causal learner is an algorithm that: 1. Learns causal connectedness. 2. Learns v-structures. Hence, learns equivalence classes. 3. Learns full variable order. Hence, learns full causal structure (order + connectedness). TETRAD II: 1, 2. Madigan et al.: 1, 2. Cooper & Herskovits K2: 1. Lam and Bacchus MDL: 1, 2 (partial), 3 (partial). Wallace, Neil, Korb MML: 1, 2, 3.

MDLMinimum Description Length (MDL) inference Invented by Rissanen (1978) based upon Minimum Message Length (MML) invented by Wallace (Wallace and Boulton, 1968). Plays trade-off btw model simplicity model t to the data by minimizing the length of a joint description of model and data given the model.



Nicholson & Korb

89

Nicholson & Korb

90

Lam & Bacchus (1993)MDL encoding of causal models: Network:

Lam & Bacchusj 2 (i) sj ]Search algorithm: Initial constraints taken from domain expert: partial variable order, direct connections Greedy search: every possible arc addition is tested, best MDL measure used to add one (Note: no arcs are deleted) Local arcs checked for improved MDL via arc reversal Iterate until MDL fails to improve

ki log(n) for specifying ki parents for ithnode

n i=1 ki log(n) + d(si ki

1)

d(si 1) j=1 sj for specifying the CPT: d is the xed bit-length per probability si is the number of states for node i N N M (Xi; (i)) is mutual information btw Xiand its parent set

Data given network:

n i=1 M (Xi ; (i))

n i=1 H (Xi )

H (Xi ) is entropy of variable Xi

)Results similar to K2, but without full variableordering

(NB: This code is not efcient. E.g., treats every node as equally likely to be a parent; assumes knowledge of all ki .)Bayesian AI Tutorial Bayesian AI Tutorial

Nicholson & Korb

91

Nicholson & Korb

92

MMLMinimum Message Length (Wallace & Boulton 1968) uses Shannons measure of information:

MML Metric for Linear ModelsNetwork:

I (m) = log P (m)Applied in reverse, we can compute P (h; e) from I (h; e). Given an efcient joint encoding method for the hypothesis & evidence space (i.e., satisfying Shannons law), MML: Searches fhi g for that hypothesis h that minimizes I (h) + I (ejh). Equivalent to that h that maximizes P (h)P (ejh) i.e., P (hje). The other signicant difference from MDL: MML takes parameter estimation seriously.Bayesian AI Tutorial

log n! + n(n2 1) log E

log n! for variable ordern(n1)

log E restore efciency by subtractingcost of selecting a linear extension

2

for connectivity

Parameters given dag h:

Xj

f log p( j jh) F ( j)

where j are the parameters for Xj and F ( j ) is the Fisher information. f ( j jh) is assumed to be N (0; j ). (Cf. with MDLs xed length for parms)


Nicholson & Korb

93

Nicholson & Korb

94

MML Metric for discrete models MML Metric for Linear ModelsSample for Xj given h and We can use PCH (hi ; e) (from Cooper & Herskovits) to dene an MML metric for discrete models. Difference between MML and Bayesian metrics:2jk

j:

log P (ejh; j ) =

K k=1

p1 e2 j

=

2 2 j

where K is the number of sample values and jk is the difference between the observed value of Xj and its linear prediction.

MML partitions the parameter space and selects optimal parameters. Equivalent to a penalty of 1 log 6e per parameter 2 (Wallace & Freeman 1987); hence:

I (e; hi) = j2sj log 6e log PCH (hi ; e)Applied in MML Sampling algorithm.

(1)



Nicholson & Korb

95

Nicholson & Korb

96

MML search algorithmsMML metrics need to be combined with search. This has been done three ways: 1. Wallace, Korb, Dai (1996): greedy search (linear). Brute force computation of linear extensions (small models only). 2. Neil and Korb (1999): genetic algorithms (linear). Asymptotic estimator of linear extensions GA chromosomes = causal models Genetic operators manipulate them Selection pressure is based on MML 3. Wallace and Korb (1999): MML sampling (linear, discrete). Stochastic sampling through space of totally ordered causal models No counting of linear extensions requiredBayesian AI Tutorial Bayesian AI Tutorial

MML SamplingSearch space of totally ordered models (TOMs). Sampled via a Metropolis algorithm (Metropolis et al. 1953). From current model M , nd the next model M 0 by: Randomly select a variable; attempt to swap order with its predecessor. Or, randomly select a pair; attempt to add/delete an arc. Attempts succeed whenever P (M 0 )=P (M ) > U (per MML metric), where U is uniformly random from 0 : 1].

Nicholson & Korb

97

Nicholson & Korb

98

Empirical Results MML SamplingMetropolis: this procedure samples TOMs with a frequency proportional to their posterior probability. To nd posterior of dag h: keep count of visits to all TOMs consistent with h Estimated by counting visits to all TOMs with identical max likelihoods to h Output: Probabilities of Top dags Top statistical equivalence classes Top MML equivalence classes A weakness in this area and AI generally. Paper publications based upon very small models, loose comparisons. ALARM net often used everything sets it to within 1 or 2 arcs. Neil and Korb (1999) compared MML and BGe (Heckerman & Geigers Bayesian metric over equivalence classes), using identical GA search over linear models: On KL distance and topological distance from the true model, MML and BGe performed nearly the same. On test prediction accuracy on strict effect nodes (those with no children), MML clearly outperformed BGe.



Nicholson & Korb

99

Nicholson & Korb

100

Current Research Issuessize and complexity difculties with elicitation combinations of discrete and continuous (i.e. mixing node types) Learning issues Missing data Latent variables Experimental data Learning CPT structure Multi-structure models continuous & discrete CPTs w/ & w/o parm independence inappropriate problems (deterministic systems, legal rules)

(Other) Limitations



Nicholson & Korb

101

Nicholson & Korb

102

ReferencesIntroduction to Bayesian AIT. Bayes (1764) An Essay Towards Solving a Problem in the Doctrine of Chances. Phil Trans of the Royal Soc of London. Reprinted in Biometrika, 45 (1958), 296-315. B. Buchanan and E. Shortliffe (eds.) (1984) Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley. B. de Finetti (1964) Foresight: Its Logical Laws, Its Subjective Sources, in Kyburg and Smokler (eds.) Studies in Subjective Probability. NY: Wiley. D. Heckerman (1986) Probabilistic Interpretations for MYCINs Certainty Factors, in L.N. Kanal and J.F. Lemmer (eds.) Uncertainty in Articial Intelligence. North-Holland. C. Howson and P. Urbach (1993) Scientic Reasoning: The Bayesian Approach. Open Court. A MODERN REVIEW OF BAYESIAN THEORY. K.B. Korb (1995) Inductive learning and defeasible inference, Jrn for Experimental and Theoretical AI, 7, 291-324.Bayesian AI Tutorial

R. Neapolitan (1990) Probabilistic Reasoning in Expert Systems. Wiley. C HAPTERS 1, 2 AND 4 COVER SOME OF THE RELEVANT HISTORY. J. Pearl (1988) Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann.

F. P. Ramsey (1931) Truth and Probability in The Foundations of Mathematics and Other Essays. NY: Humanities Press. T HE ORIGIN OF MODERN BAYESIANISM . I NCLUDES LOTTERY- BASED ELICITATION AND D UTCH - BOOK ARGUMENTS FOR THE USE OF PROBABILITIES. R. Reiter (1980) A logic for default reasoning, Articial Intelligence, 13, 81-132. J. von Neumann and O.Morgenstern (1947) Theory of Games and Economic Behavior, 2nd ed. Princeton Univ. S TANDARD REFERENCE ON ELICITING UTILITIES VIA LOTTERIES.

Bayesian NetworksE. Charniak (1991) Bayesian Networks Without Tears, Articial Intelligence Magazine, pp. 50-63, Vol 12. A N ELEMENTARY INTRODUCTION.Bayesian AI Tutorial

Nicholson & Korb

103

Nicholson & Korb

104

D. DAmbrosio (1999) Inference in Bayesian Networks. Articial Intelligence Magazine, Vol 20, No. 2. P. Haddaway (1999) An Overview of Some Recent Developments in Bayesian Problem-Solving Techniques. Articial Intelligence Magazine, Vol 20, No. 2. Howard & Matheson (1981) Inuence Diagrams, Principles and Applications of Decision Analysis. F. V. Jensen (1996) An Introduction to Bayesian Networks, Springer. R. Neapolitan (1990) Probabilistic Reasoning in Expert Systems. Wiley. S IMILAR COVERAGE TO THAT OF P EARL ; MOREEMPHASIS ON PRACTICAL ALGORITHMS FOR NETWORK UPDATING.

ApplicationsD.W. Albrecht, I. Zukerman and Nicholson, A.E. (1998) Bayesian Models for Keyhole Plan Recognition in an Adventure Game. User Modeling and User-Adapted Interaction, 8(1-2), 5-47, Kluwer Academic Publishers. S. Andreassen, F.V. Jensen, S.K. Andersen, B. Falck, U. Kjrulff, M. Woldbye, A.R. Srensen, A. Rosenfalck and F. Jensen (1989) MUNIN An Expert EMG Assistant, Computer-Aided Electromyography and Expert Systems, Chapter 21, J.E. Desmedt (Ed.), Elsevier. S.A. Andreassen, J.J Benn, R. Hovorks, K.G. Olesen and R.E. Carson (1991) A Probabilistic Approach to Glucose Prediction and Insulin Dose Adjustment: Description of Metabolic Model and Pilot Evaluation Study. I. Beinlich, H. Suermondt, R. Chavez and G. Cooper (1992) The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks, Proc. of the 2nd European Conf. on Articial Intelligence in medicine, pp. 689-693. T.L Dean and M.P. Wellman (1991) Planning and control, Morgan Kaufman. T.L. Dean, J. Allen and J. Aloimonos (1994) Articial Intelligence: Theory and Practice, Benjamin/Cummings.Bayesian AI Tutorial

J. Pearl (1988) Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann. T HIS IS THE CLASSIC TEXT INTRODUCING BAYESIAN NETWORKS TO THE AI COMMUNITY. Poole, D., Mackworth, A., and Goebel, R. (1998) Computational Intelligence: a logical approach. Oxford University Press. Russell & Norvig (1995) Articial Intelligence: A Modern Approach, Prentice Hall.Bayesian AI Tutorial

Nicholson & Korb

105

Nicholson & Korb

106

P. Dagum, A. Galper and E. Horvitz (1992) Dynamic Network Models for Forecasting, Proceedings of the 8th Conference on Uncertainty in Articial Intelligence, pp. 41-48. J. Forbes, T. Huang, K. Kanazawa and S. Russell (1995) The BATmobile: Towards a Bayesian Automated Taxi, Proceedings of the 14th Int. Joint Conf. on Articial Intelligence (IJCAI95), pp. 1878-1885. S.L Lauritzen and D.J. Spiegelhalter (1988) Local Computations with Probabilities on Graphical Structures and their Application to Expert Systems, Journal of the Royal Statistical Society, 50(2), pp. 157-224. McConachy et al (1999) A.E. Nicholson (1999) CSE2309/3309 Articial Intelligence, Monash University, Lecture Notes, a http://www.csse.monash.edu.au/nnn/2-3309.html. M. Pradham, G. Provan, B. Middleton and M. Henrion (1994) Knowledge engineering for large belief networks, Proceedings of the 10th Conference on Uncertainty in Articial Intelligence. D. Pynadeth and M. P. Wellman (1995) Accounting for Context in Plan Recogniition, with Application to Trafc Monitoring, Proceedings of the 11th Conference on Uncertainty in Articial Intelligence, pp.472-481.Bayesian AI Tutorial

M. Shwe and G. Cooper (1990) An Empirical Analysis of Likelihood-Weighting Simulation on a Large, Multiply Connected Belief Network, Proceedings of the Sixth Workshop on Uncertainty in Articial Intelligence, pp. 498-508, 1990. L.C. van der Gaag, S. Renooij, C.L.M. Witteman, B.M.P. Aleman, B.G. Tall (1999) How to Elicit Many Probabilities, Laskey & Prade (eds) UAI99, 647-654. Zukerman, I., McConachy, R., Korb, K. and Pickett, D. (1999) Exploratory Interaction with a Bayesian Argumentation System, in IJCAI-99 Proceedings the Sixteenth International Joint Conference on Articial Intelligence, pp. 1294-1299, Stockholm, Sweden, Morgan Kaufmann.

Learning Bayesian NetworksH. Blalock (1964) Causal Inference in Nonexperimental Research. University of North Carolina. R. Bouckeart (1994) Probabilistic network construction using the minimum description length principle. Technical Report RUU-CS-94-27, Dept of Computer Science, Utrecht University. C. Boutillier, N. Friedman, M. Goldszmidt, D. Koller (1996) Context-specic independence in Bayesian networks, in Horvitz & Jensen (eds.) UAI 1996, 115-123.Bayesian AI Tutorial

Nicholson & Korb

107

Nicholson & Korb

108

G. Brightwell and P. Winkler (1990) Counting linear extensions is #P-complete. Technical Report DIMACS 90-49, Dept of Computer Science, Rutgers Univ. W. Buntine (1991) Theory renement on Bayesian networks, in DAmbrosio, Smets and Bonissone (eds.) UAI 1991, 52-69. W. Buntine (1996) A Guide to the Literature on Learning Probabilistic Networks from Data, IEEE Transactions on Knowledge and Data Engineering,8, 195-210. D.M. Chickering (1995) A Tranformational Characterization of Equivalent Bayesian Network Structures, in P. Besnard and S. Hanks (eds.) Proceedings of the Eleventh Conference on Uncertainty in Articial Intelligence (pp. 87-98). San Francisco: Morgan Kaufmann. STATISTICAL EQUIVALENCE . G.F. Cooper and E. Herskovits (1991) A Bayesian Method for Constructing Bayesian Belief Networks from Databases, in DAmbrosio, Smets and Bonissone (eds.) UAI 1991, 86-94. G.F. Cooper and E. Herskovits (1992) A Bayesian Method for the Induction of Probabilistic Networks from Data, Machine Learning, 9, 309-347. A N EARLY BAYESIAN CAUSAL DISCOVERY METHOD.Bayesian AI Tutorial

H. Dai, K.B. Korb, C.S. Wallace and X. Wu (1997) A study of casual discovery with weak links and small samples. Proceedings of the Fifteenth International Joint Conference on Articial Intelligence (IJCAI), pp. 1304-1309. Morgan Kaufmann. N. Friedman (1997) The Bayesian Structural EM Algorithm, in D. Geiger and P.P. Shenoy (eds.) Proceedings of the Thirteenth Conference on Uncertainty in Articial Intelligence (pp. 129-138). San Francisco: Morgan Kaufmann. Geiger and Heckerman (1994) Learning Gaussian networks, in Lopes de Mantras and Poole (eds.) UAI 1994, 235-243. D. Heckerman and D. Geiger (1995) Learning Bayesian networks: A unication for discrete and Gaussian domains, in Besnard and Hankds (eds.) UAI 1995, 274-284. D. Heckerman, D. Geiger, and D.M. Chickering (1995) Learning Bayesian Networks: The Combination of Knowledge and Statistical Data, Machine Learning, 20, 197-243. BAYESIAN LEARNING OF STATISTICAL EQUIVALENCE CLASSES. K. Korb (1999) Probabilistic Causal Structure in H. Sankey (ed.) Causation and Laws of Nature:Bayesian AI Tutorial

Nicholson & Korb

109

Nicholson & Korb

110

Australasian Studies in History and Philosophy of Science 14. Kluwer Academic. I NTRODUCTION TO THE RELEVANT PHILOSOPHY OF CAUSATION FOR LEARNING BAYESIAN NETWORKS. P. Krause (1998) Learning Probabilistic Networks.

http : ==www:auai:org=bayes USKrause:ps:gzBASICINTRODUCTION TO

Structure Priors, in N. Zhong and L. Zhous (eds.) Methodologies for Knowledge Discovery and Data Mining: Third Pacic-Asia Conference (pp. 432-437). Springer Verlag. G ENETIC ALGORITHMS FOR CAUSAL DISCOVERY; STRUCTURE PRIORS. J.R. Neil, C.S. Wallace and K.B. Korb (1999) Learning Bayesian networks with restricted causal interactions, in Laskey and Prade (eds.) UAI 99, 486-493. J. Rissanen (1978) Modeling by shortest data description, Automatica, 14, 465-471. H. Simon (1954) Spurious Correlation: A Causal Interpretation, Jrn Amer Stat Assoc, 49, 467-479. D. Spiegelhalter & S. Lauritzen (1990) Sequential Updating of Conditional Probabilities on Directed Graphical Structures, Networks, 20, 579-605. P. Spirtes, C. Glymour and R. Scheines (1990) Causality from Probability, in J.E. Tiles, G.T. McKee and G.C. Dean Evolving Knowledge in Natural Science and Articial Intelligence. London: Pitman. A NELEMENTARY INTRODUCTION TO STRUCTURE LEARNING VIA CONDITIONAL INDEPENDENCE .

BN S,

PARAMETERIZATION

AND LEARNING CAUSAL STRUCTURE .

W. Lam and F. Bacchus (1993) Learning Bayesian belief networks: An approach based on the MDL principle, Jrn Comp Intelligence, 10, 269-293. D. Madigan, S.A. Andersson, M.D. Perlman & C.T. Volinsky (1996) Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs, Comm in Statistics: Theory and Methods, 25, 2493-2519. D. Madigan and A. E. Raftery (1994) Model selection and accounting for model uncertainty in graphical modesl using Occams window, Jrn AMer Stat Assoc, 89, 1535-1546. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller and E. Teller (1953) Equations of state calculations by fast computing machines, Jrn Chemical Physics, 21, 1087-1091. J.R. Neil and K.B. Korb (1999) The Evolution of Causal Models: A Comparison of Bayesian Metrics andBayesian AI Tutorial

P. Spirtes, C. Glymour and R. Scheines (1993) Causation, Prediction and Search: Lecture Notes in Statistics 81.Bayesian AI Tutorial

Nicholson & Korb

111

Nicholson & Korb

112

Springer Verlag. A THOROUGH PRESENTATIONSTRUCTURE .

OF THE ORTHODOX

STATISTICAL APPROACH TO LEARNING CAUSAL

J. Suzuki (1996) Learning Bayesian Belief Networks Based on the Minimum Description Length Principle, in L. Saitta (ed.) Proceedings of the Thirteenth International Conference on Machine Learning (pp. 462-470). San Francisco: Morgan Kaufmann. T.S. Verma and J. Pearl (1991) Equivalence and Synthesis of Causal Models, in P. Bonissone, M. Henrion, L. Kanal and J.F. Lemmer (eds) Uncertainty in Articial Intelligence 6 (pp. 255-268). Elsevier. T HE GRAPHICAL CRITERION FOR STATISTICAL EQUIVALENCE . C.S. Wallace and D. Boulton (1968) An information measure for classication, Computer Jrn, 11, 185-194. C.S. Wallace and P.R. Freeman (1987) Estimation and inference by compact coding, Jrn Royal Stat Soc (Series B), 49, 240-252. C. S. Wallace and K. B. Korb (1999) Learning Linear Causal Models by MML Sampling, in A. Gammerman (ed.) Causal Models and Intelligent Data Management. Springer Verlag. S AMPLING APPROACH TO LEARNING CAUSAL MODELS ; DISCUSSION OF STRUCTURE PRIORS.Bayesian AI Tutorial

C. S. Wallace, K. B. Korb, and H. Dai (1996) Causal Discovery via MML, in L. Saitta (ed.) Proceedings of the Thirteenth International Conference on Machine Learning (pp. 516-524). San Francisco: Morgan Kaufmann. I NTRODUCES AN MML METRIC FOR CAUSAL MODELS. S. Wright (1921) Correlation and Causation, Jrn Agricultural Research, 20, 557-585. S. Wright (1934) The Method of Path Coefcients, Annals of Mathematical Statistics, 5, 161-215.

Current Research

Bayesian Network URLs


Documents

Ai99 Tutorial 4