39
2004 University of Pittsburgh Bayesian Biosurveillance Using Multiple Data Streams Greg Cooper, Weng-Keen Wong, Denver Dash*, John Levander, John Dowling, Bill Hogan, Mike Wagner RODS Laboratory, University of Pittsburgh * Intel Research, Santa Clara

Bayesian Biosurveillance Using Multiple Data Streams

  • Upload
    sheng

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Bayesian Biosurveillance Using Multiple Data Streams Greg Cooper, Weng-Keen Wong, Denver Dash*, John Levander, John Dowling, Bill Hogan, Mike Wagner RODS Laboratory, University of Pittsburgh * Intel Research, Santa Clara. Outline. Introduction Model Inference Conclusions. - PowerPoint PPT Presentation

Citation preview

Page 1: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Bayesian Biosurveillance Using Multiple Data Streams

Greg Cooper, Weng-Keen Wong, Denver Dash*, John Levander, John Dowling, Bill Hogan, Mike Wagner

RODS Laboratory, University of Pittsburgh

* Intel Research, Santa Clara

Page 2: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Outline

1. Introduction2. Model3. Inference4. Conclusions

Page 3: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Over-the-Counter (OTC) Data Being Collected by the National Retail Data Monitor (NRDM)

19,000 stores

50% market share

nationally

>70% market share in large cities

Page 4: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

ED Chief Complaint Data Being Collected by

RODS

Date / Time Admitted

Age Gender Home Zip Work Zip Chief Complaint

Nov 1, 2004 3:02 20-30 Male 15213 Shortness of breath

Nov 1, 2004 3:09 70-80 Female 15132 15213 Fever

: : : : : :

Chief Complaint ED Records for Allegheny County

Page 5: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Objective

Using the ED and OTC data streams, detect a disease outbreak in a given region as quickly and accurately as possible

Page 6: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Our Approach

• A detection algorithm that models each individual in the population

• Combines ED and OTC data streams• The current prototype focuses on

detecting an outdoor aerosolized release of an anthrax-like agent in Allegheny county

Population-wide ANomaly Detection and Assessment (PANDA)

Page 7: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

PANDA

Visit of Person to ED

Location of Anthrax Release

Anthrax Infection of Person

Bayesian Network: A graphical model representing the joint probability distribution of a set of random variables

Uses a causal Bayesian network

Home Location of Person

Page 8: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

PANDA

The arrows convey conditional independence relationships among the variables. They also represent causal relationships.

Uses a causal Bayesian network

Visit of Person to ED

Location of Anthrax Release

Anthrax Infection of Person

Home Location of Person

Page 9: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Outline

1. Introduction2. Model3. Inference4. Conclusions

Page 10: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

A Schematic of the Generic PANDA Model for Non-

Contagious Diseases

Population Risk Factors

Population Disease Exposure (PDE)

Person Model

Population-WideEvidence

Person Model Person Model Person Model

Page 11: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

A Special Case of the Generic Model

Time of Release

Person Model

Anthrax Release

Location of Release

Person ModelPerson ModelPerson Model

OTC Sales for Region

Each person in the population is represented as a subnetwork in the overall model

Page 12: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Location of Release

Time Of Release

Anthrax Infection

Home Zip

Respiratory from Anthrax

Other ED Disease

Gender

Age Decile

Respiratory CCFrom Other

RespiratoryCC

Respiratory CCWhen Admitted

ED Admitfrom Anthrax

ED Admit from Other

ED Acute Respiratory

Infection

Acute RespiratoryInfection

Daily OTC Purchase

Last 3 Days OTCPurchase

Non-ED AcuteRespiratory Infection

ED Admission

The Person Model

OTC Sales for Region

Page 13: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Why Use a Population-Based Approach?

1. Representational power• Spatial, temporal, demographic, and symptom

knowledge of potential diseases can be coherently represented in a single model

• Spatial, temporal, demographic, and symptom evidence can be combined to derive a posterior probability of a disease outbreak

2. Representational flexibilityNew types of knowledge and evidence can be readily incorporated into the model

Hypothesis: A population-based approach will achieve better detection performance than non-population-based approaches.

Page 14: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Location of Release

Time Of Release

Anthrax Infection

Home Zip

Respiratory from Anthrax

Other ED Disease

Gender

Age Decile

Respiratory CCFrom Other

RespiratoryCC

Respiratory CCWhen Admitted

ED Admitfrom Anthrax

ED Admit from Other

ED Acute Respiratory

Infection

Acute RespiratoryInfection

Daily OTC Purchase

Last 3 Days OTCPurchase

Non-ED AcuteRespiratory Infection

ED Admission

The Person Model

OTC Sales for Region

Page 15: Bayesian Biosurveillance Using Multiple Data Streams

Location of Release

Time Of Release

Anthrax Infection

Home Zip

Respiratory from Anthrax

Other ED Disease

Gender

Age Decile

Respiratory CCFrom Other

RespiratoryCC

Respiratory CCWhen Admitted

ED Admitfrom Anthrax

ED Admit from Other

ED AcuteRespiratory

Infection

Acute RespiratoryInfection

Daily OTC Purchase

Last 3 Days OTCPurchase

Non-ED AcuteRespiratory Infection

ED Admission

The Person Model

AgeDecile

Gender Home Zip

Respiratory Chief Comp.

DateAdmitted

20-30 Male 15213 Yes Today

Equivalence Class Example:

Page 16: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Outline

1. Introduction2. Model3. Inference4. Conclusions

Page 17: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Inference

Time of Release

Person Model

Anthrax Release

Location of Release

Person ModelPerson ModelPerson Model

Derive P (Anthrax Release = true | OTC Sales Data & ED Data)

OTC Sales for Region

gfc
You need to make the point that each person model has variables that indicate whether the person visited the ED, and if so, then other variables contain information about that visit, such as the time and the chief complaint.
gfc
Say this is the model that is the focus of this talk.
Page 18: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

InferenceAR = Anthrax Release ED = ED Data

PDE = Population Disease Exposure

OTC = OTC Counts

P ( OTC, ED | PDE ) =

P ( OTC | ED, PDE ) P ( ED | PDE )

Contribution of ED DataContribution of OTC Counts

Key Term in Deriving P ( AR | OTC, ED ) :

Details in: Cooper GF, Dash DH, Levander J, Wong W-K, Hogan W, Wagner M. Bayesian Biosurveillance of Disease Outbreaks. In: Proceedings of the

Conference on Uncertainty in Artificial Intelligence, 2004.

Page 19: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

InferenceAR = Anthrax Release ED = ED Data

PDE = Population Disease Exposure

OTC = OTC Counts

P ( OTC, ED | PDE ) =

P ( OTC | ED, PDE ) P ( ED | PDE )

The focus of the remainder of this talk

Key Term in Deriving P ( AR | OTC, ED ) :

Page 20: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Location of Release

Time Of Release

Anthrax Infection

Home Zip

Respiratory from Anthrax

Other ED Disease

Gender

Age Decile

Respiratory CCFrom Other

RespiratoryCC

Respiratory CCWhen Admitted

ED Admitfrom Anthrax

ED Admit from Other

ED Acute Respiratory

Infection

Acute RespiratoryInfection

Daily OTC Purchase

Last 3 Days OTCPurchase

Non-ED AcuteRespiratory Infection

ED Admission

The Person Model

OTC Sales for Region

Page 21: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Incorporating the Counts of OTC Purchases

Eq Class1 Zip1OTC count

Zip1OTC count

Eq Classs2 Zip1OTC count

Person1 Zip1OTC count

Person2 Zip1OTC count

Person3 Zip1OTC count

Person4 Zip1OTC count

Approximate binomial

distribution with a normal distribution

Page 22: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

The PANDA OTC Model

P (OTC sales = X | ED, PDE ) ),;(Normal 2i

ii

iE

EE

EX

Recall that:

P ( OTC, ED | PDE ) =

P ( OTC | ED, PDE ) P ( ED | PDE )

Page 23: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

ExampleAgeDecile

Gender

Home Zip

Respiratory Chief Comp.

DateAdmitted

50-60 Male 15213

Yes Today

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0 50 100 150 200 250 300 350

Equivalence Class 1 ~ Normal(100,100)

Page 24: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

ExampleAgeDecile

Gender

Home Zip

Respiratory Chief Comp.

DateAdmitted

50-60 Male 15213

Yes Today

Equivalence Class 1 ~ Normal(100,100)

AgeDecile

Gender

Home Zip

Respiratory Chief Comp.

DateAdmitted

50-60 Female 15213

Yes Today

Equivalence Class 2 ~ Normal(150,225)

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0 50 100 150 200 250 300 350

Page 25: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0 50 100 150 200 250 300 350

ExampleAgeDecile

Gender

Home Zip

Respiratory Chief Comp.

DateAdmitted

50-60 Male 15213

Yes Today

Equivalence Class 1 ~ Normal(100,100)

AgeDecile

Gender

Home Zip

Respiratory Chief Comp.

DateAdmitted

50-60 Female 15213

Yes Today

Equivalence Class 2 ~ Normal(150,225)

If these were the only 2 Equivalence Classes in the County then

County Cough & Cold OTC ~ Normal(100+150,100+225)

Page 26: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0 50 100 150 200 250 300 350

ExampleNow suppose 260 units are sold in the county

P( OTC Sales = 260 | ED Data, PDE ) =

Normal( 260; 250, 325 ) = 0.001231

260

Page 27: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Inference TimingMachine: P4 3 Gigahertz, 2 GB RAM

Initialization Time (seconds)

Each hour of data (seconds)

ED model 55 5

ED and OTC model

229 5

Page 28: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

A Current Limitation

• Problem: Currently we assume unrealistically that a person only makes OTC purchases in his or her home zip code

• Approach 1: Aggregate OTC-counts (e.g., at the county level)

• Approach 2: For each home zip code, model the distribution of zip codes where OTC purchases are made

Page 29: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Outline

1. Introduction2. Model3. Inference4. Conclusions

Page 30: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Challenges in Population-Wide Modeling Include …

• Obtaining good parameter estimates to use in modeling (e.g., the probability of an OTC cough medication purchase given an acute respiratory illness)

• Modeling time and space in a way that is both useful and computationally tractable

• Modeling contagious diseases

Page 31: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Conclusions• PANDA is a multivariate algorithm that

can combine multiple data streams• Modeling each individual in the

population is computationally feasible (so far)

• An evaluation of the PANDA approach to modeling multiple data streams is in progress using semi-synthetic test data

gfc
We don't really show this experimentaly here. Hopefully by the time you give this talk, we will have such data.
Page 32: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Thank you

Current funding:National Science Foundation

Department of Homeland Security

Earlier funding:DARPA

http://www.cbmi.pitt.edu/panda/

[email protected]

Page 33: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Page 34: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

The PANDA OTC ModelModel the OTC purchases for each Equivalence Class Ei as a binomial Distribution.

Ei ~ Binomial(NEi ,PEi

)

Page 35: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

The PANDA OTC ModelModel the OTC purchases for each Equivalence Class Ei as a binomial Distribution.

Ei ~ Binomial(NEi ,PEi

)

Number of people in Equivalence Class Ei

Probability of an OTC cough medication purchase during the previous 3 days by each person in Equivalence Class Ei

Page 36: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

The PANDA OTC ModelModel the OTC purchases for each Equivalence Class Ei as a binomial Distribution.

Approximate the binomial distribution as a normal distribution.

Ei ~ Binominal(NEi ,PEi

)

Normal(Ei ,2

Ei)

Page 37: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

The PANDA OTC ModelModel the OTC purchases for each Equivalence Class Ei as a binomial Distribution.

Approximate the binomial distribution as a normal distribution.

Ei ~ Binominal(NEi ,PEi

)

Normal(Ei ,2

Ei)Ei

= NEi × PEi

2Ei = NEi

× PEi× (1 - PEi

)

Page 38: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Computational Cost of a Population-Wide Approach?

~1.4 million people in Allegheny County, Pennsylvania

Page 39: Bayesian Biosurveillance Using Multiple Data Streams

2004 University of Pittsburgh

Equivalence Classes

The ~1.4M people in the modeled population can be partitioned into approximately 24,240 equivalence classes