56
Biosurveillance 2.0 Collaboration and Web 2.0/3.0 Semantic Technologies for Better Early Disease Warning and Effective Response Taha Kass-Hout Nicolás di Tada

Biosurveillance 2.0

Embed Size (px)

DESCRIPTION

Seamlessly integrating various early disease indicators with experts' opinion for better event warning and response...

Citation preview

Page 1: Biosurveillance 2.0

Biosurveillance 2.0 Collaboration and Web 2.0/3.0 Semantic

Technologies for Better Early Disease Warning and Effective Response

Taha Kass-HoutNicolás di Tada

Page 2: Biosurveillance 2.0

Background

Page 3: Biosurveillance 2.0

DAY

CA

SE

S

Opportunity for control

Background

Late Detection and Response

Page 4: Biosurveillance 2.0

DAY

CA

SE

S

Opportunity for control

Background

Early Detection and Response

Page 5: Biosurveillance 2.0

Public Health Measures

• Representativeness

• Completeness

• Predictive Value

• Timeliness

Background

Page 6: Biosurveillance 2.0

Public Health Measures

1000 Malaria infections (100%)

50 Malaria notifications (5%)

Get as close to the bottom of the pyramid

as possible

Urge frequent reporting: Weekly daily immediately

Specificity / Reliability

Sensitivity / Timeliness • Main attributes

o Representativenesso Completenesso Predictive value positive

Background

Page 7: Biosurveillance 2.0

Analyze and interpret

Signal as early

as possible

Automated analysis/thresholds

Time

• Main attributeso Timeliness

Public Health MeasuresHealth care hotline

Background

Page 8: Biosurveillance 2.0

The Problem Space

• Current systems design, analysis and evaluation has been geared towards specific data sources and detection algorithms – not humans

• We have systems in place for those threats we have been faced with before

The Problem

Page 9: Biosurveillance 2.0

Traditional DISEASE SURVEILLANCE

• In the past two decades focus was on – automatically detecting anomalous patterns in

data (often a single stream)

• Modern methods– rely on human input and judgment – incorporate temporal, spatial, and multivariate

information

The Problem

Page 10: Biosurveillance 2.0

9/20, 15213, cough/cold, …9/21, 15207, antifever, …9/22, 15213, CC = cough, ...1,000,000 more records…

Huge mass of data Detection algorithm “What are we supposed to do with

this?”

Too many alerts

Traditional DISEASE SURVEILLANCE

The Problem

Page 11: Biosurveillance 2.0

Our Approach

• Human-based

• Collaborative and cross-disciplinary

• Web 2.0/3.0 platform

Our Approach

Page 12: Biosurveillance 2.0

Information Sources

• Event-based - ad-hoc unstructured reports issued by formal or informal sources

• Indicator-based - (number of cases, rates, proportion of strains…)

Timeliness, Representativeness, Completeness, Predictive Value, Quality, …

Our Approach

Page 13: Biosurveillance 2.0

9/20, 15213, cough/cold, …9/21, 15207, antifever, …9/22, 15213, CC = cough, ...1,000,000 more records…

Huge mass of data

Feedback loop

MODERN DISEASE SURVEILLANCE

Our Approach

Fewer and more actionable alerts

Effective and coordinated response

Page 14: Biosurveillance 2.0

Evolve: Main Components

Feature extraction, reference and baseline information

Tags

Multiple Data Streams

User-Generated and Machine Learning Metadata

Comments

Spatio-temporal

Flags/Alerts/Bookmarks

Evo

lve Bo

t

Event Classification,

Characterization and Detection

Previous Event Training Data

Previous Event Control Data

Metadataextraction

Machine learning

Social network

Professional feedback

Anomaly detection

Collaborative Spaces

Hypotheses generation\testing

Our Solution

Page 15: Biosurveillance 2.0

Evolve: Main Components

Our Solution

Page 16: Biosurveillance 2.0

Item

Hypothesis

Field Actions and Verifications

Feedback / Confirmation

Our Solution

Evolve: Process

Item ItemItem

Item

Item Item

ItemItem

Page 17: Biosurveillance 2.0

Advantages of Machine Learning

P(malaria) = 22% P(influenza) = 13% P(other ILI) = 33%

Our Solution

Page 18: Biosurveillance 2.0

Machine Learning Techniques

1. Classifiers

2. Clustering

3. Bayesian Statistics

4. Neural Networks

5. Genetic Algorithms

Our Solution

Page 19: Biosurveillance 2.0

How to represent a document:

cold

fever

Our Solution

Page 20: Biosurveillance 2.0

(1) Classifiers:Problem Definition

• Map items to vectors (Feature extraction)

• Normalize those vectors

• Train the classifier

• Measure the results with new information

• Feedback the classifier

• Separate classes in feature space

Our Solution

Page 21: Biosurveillance 2.0

Classifiers:Support Vector Machines (SVM)

Our Solution

Page 22: Biosurveillance 2.0

SVM – Margin Maximization

• Support vectors define the separator

Our Solution

Page 23: Biosurveillance 2.0

SVM – Non-linear?

Φ: x → φ(x)

Map to higher-dimension space

Our Solution

Page 24: Biosurveillance 2.0

SVM – Filtering or classifying

ClassifierClassifier

Document 1

Document 1

Document 2

Document 2

Document 3

Document 3

PositivesPositives

NegativesNegatives

Training DocumentTraining

DocumentTraining

DocumentTraining

Document

Our Solution

Page 25: Biosurveillance 2.0

(2) Clustering:Problem Definition

• Map items to vectors (Feature extraction)

• Normalization

• Agglomerative or Partitional

Our Solution

Page 26: Biosurveillance 2.0

Clustering: AGGLOMERATIVE

Our Solution

Page 27: Biosurveillance 2.0

Clustering: PARTITIONAL

Our Solution

Page 28: Biosurveillance 2.0

(3) Bayesian Statistics

P(A |B) P(B | A).P(A)

P(B)

Probability of disease A (flu)

once symptom B (fever) is observed

Probability of disease A (flu)

once symptom B (fever) is observed

Probability of fever once flu is confirmed

Probability of fever once flu is confirmed

Probability of flu (prior or marginal)

Probability of flu (prior or marginal)

Probability of fever (prior or

marginal)

Probability of fever (prior or

marginal)

Our Solution

Page 29: Biosurveillance 2.0

(4) Neural Networks

• Given a set of stimuli, train a system to produce a given output…

Our Solution

Page 30: Biosurveillance 2.0

Hidden LayerHidden Layer

Output LayerOutput Layer

Input LayerInput Layer

Neural Network: Structure

[…]

[…]

{I0,I1,……In}

{O0,O1,……On}

Weight

Weight

).(0 in

I

i in wIH

Our Solution

Page 31: Biosurveillance 2.0

Neural Network:Application

Event?

Our Solution

Page 32: Biosurveillance 2.0

(5) Genetic Algorithm:Basic

• Define the model that you want to optimize

• Create the fitness function

• Evolve the gene pool testing against the fitness function.

• Select the best individual

Our Solution

Page 33: Biosurveillance 2.0

Genetic Algorithm:Model

• Model the transmission process using a set of parameters (e.g., an infectious disease):– Onset time between an infection and illness– Latency period– Incubation period– Symptomatic period– Infectious period

(Onset, Latency, Incubation, Symptomatic , Infectious)

( 2 days, 3 days, 1 day, 4 days, 3 days)

Our Solution

Page 34: Biosurveillance 2.0

Genetic Algorithm:Model Fitness

Fitness = 1/AreaFitness = 1/Area

Our Solution

Page 35: Biosurveillance 2.0

Genetic Algorithm:Process

1. Create an initial population of candidates

2. Use operators to generate new candidates (mating and mutation)

3. Discard worst individuals or select best individuals in generation

4. Repeat from 2 until you find a candidate that satisfies the solution searched

Our Solution

Page 36: Biosurveillance 2.0

(4,5,6,3,5) (4,3,6,2,5)

Genetic Algorithm:Process

(5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2)

(2,3,4,6,5) (3,4,5,2,6)

(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)

(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)

(5,3,2,6,5)

(3,4,4,6,2)

(5,3,2,6,5)

(3,4,4,6,2)

Our Solution

Page 37: Biosurveillance 2.0

Result of incorporating all 5 techniques:Improved Surveillance

Our Solution

Page 38: Biosurveillance 2.0

Our Solution

InSTEDD Evolve

Related items (e.g., News articles) are grouped into a thread. Threads are later associated with events (hypothesized or confirmed).

Related items (e.g., News articles) are grouped into a thread. Threads are later associated with events (hypothesized or confirmed).

InSTEDD Evolve: (http://instedd.org/evolve)

Tag cloud and semantic heatmap

Tag cloud and semantic heatmap

Page 39: Biosurveillance 2.0

Our Solution

InSTEDD Evolve

InSTEDD Evolve: (http://instedd.org/evolve)

Filter feature which automatically filters for related items, updates the map and associated tagsFilter feature which automatically filters for related items, updates the map and associated tags

Page 40: Biosurveillance 2.0

Our Solution

InSTEDD Evolve

InSTEDD Evolve: (http://instedd.org/evolve)

Auto-generated (machine-learning) tags. These tags are semantically ranked (a

statistical probability match). Users can further train the classifier by accepting or rejecting a suggestion. Users can similarly

train the geo-locator by simply accepting or rejecting and updating a location.

Auto-generated (machine-learning) tags. These tags are semantically ranked (a

statistical probability match). Users can further train the classifier by accepting or rejecting a suggestion. Users can similarly

train the geo-locator by simply accepting or rejecting and updating a location.

Page 41: Biosurveillance 2.0

Our Solution

InSTEDD Evolve

InSTEDD Evolve: (http://instedd.org/evolve)

Tracking the recent Avian Influenza Outbreak in Egypt (reports started to appear late January 2009). Notice the pattern of reported incidents along the Nile river.

Tracking the recent Avian Influenza Outbreak in Egypt (reports started to appear late January 2009). Notice the pattern of reported incidents along the Nile river.

Page 42: Biosurveillance 2.0

Acknowledgements

Page 43: Biosurveillance 2.0

Through funding from:

Page 44: Biosurveillance 2.0

Thank You!

Taha Kass-Hout Nicolás di Tada

Page 45: Biosurveillance 2.0

BACKGROUND MATERIAL

Page 46: Biosurveillance 2.0

Index• Disease surveillance References

– Computing– Automating Laboratory Reporting– Using EMR data for disease surveillance– Related Projects– Misc Readings

• Open Source Software (OSS) References– Open Source License References– Open Source References– Open Source and Public Health References

• Architectural Matters– Service Oriented Architecture (or SOA)– Synchronization Architecture– Cloud Architecture

Page 47: Biosurveillance 2.0

DISEASE SURVEILLANCEReferences and Related-Efforts

Page 48: Biosurveillance 2.0

REFERENCES• Izadi, M. and Buckeridge, D., Decision Theoretic Analysis of Improving Epidemic

Detection, AMIA 2007, Symposium Proceedings 2007• EpiNorth-Based material (http://www.epinorth.org):

– Mereckiene, J., Outbreak Investigation Operational Aspects. Jurmala, Latvia, 2006

– Bagdonaite, J., and Mereckiene, J., Outbreak Investigation Methodological aspects. Jurmala, Latvia, 2006

– Epidemic Intelligence: Signals from surveillance systems, Anne Mazick, Statens Serum Institut, Denmark, EpiTrain III, Jurmala, August 2006

• Daniel Neil, Incorporating Learning into Disease Surveillance Systems

Page 49: Biosurveillance 2.0

REFERENCES• Computing

– The Future of Statistical Computing in Wilkinson (2008)– Complex Event Processing Over Uncertain Data in Wasserkrug (2008)– Outbreak detection through automated surveillance A review of the

determinants of detection in Buckeridge (2007) – Approaches to the evaluation of outbreak detection methods in

Watkins (2006)– Algorithms for rapid outbreak detection a research synthesis

Buckeridge (2004)– Data mining in bioinformatics using Weka in Frank (2004)– Aho-Corasick Algorithm in Kilpeläinen

• Automating Laboratory Reporting– Automatic Electronic Laboratory-Based Reporting in Panackal (2002)– Benefits and Barriers to Electronic Laboratory Results Reporting for Notifiable

Diseases in Nguyen (2007)

Page 50: Biosurveillance 2.0

REFERENCES• Using EMR Data for Disease Surveillance

– Using Electronic Medical Records to Enhance Detection and Reporting of Vaccine Adverse Events in Hinrichsen (2007)

– Electronic Medical Record Support for PH in Klompas (2007)– A knowledgebase to support notifiable disease surveillance in Doyle (2005)– Automated Detection of Tuberculosis Using Electronic Medical Record Data in

Calderwood (2007)• Misc Readings

– Breakthrough in modeling emerging disease hotspots in Jones (2008)– Use of data mining techniques to investigate disease risk classification as a

proxy for compromised biosecurity of cattle herds in Wales in Ortiz-Pelaez (2008)

– Euclidean distance: http://en.wikipedia.org/wiki/Euclidean_distance – Tags/Folksonomy:

• Tag Decay: A View Over Aging Folksonomy in Russell (2007)• Cloudalicious: Folksonomy Over Time in Russell (2006)

Page 51: Biosurveillance 2.0

RELATED PROJECTS• InSTEDD Evolve: (http://instedd.org/evolve)

– Collaborative Analytics and Environment for Linking Early Health-Related Event Detection to an Effective Response (http://taha.instedd.org/2008/09/collaborative-analytics-and-environment.html )

• ALPACA "ALPACA Light Parsing And Classifying Application (ALPACA) is a classifying tool designed for use in community-oriented software as well as in Academia. The application consists of two parts: a parsing tool for transforming raw documents into readable data, and a classifying tool for categorizing documents into user-provided classes. The application provides a user-friendly interface and a Plug-in functionality to provide a simple way to add more parsers/classifiers to the application." http://2008.hfoss.org/ALPACA

• Weka An open source "...collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes." http://www.cs.waikato.ac.nz/~ml/weka/

Page 52: Biosurveillance 2.0

RELATED PROJECTS• The R Project for statistical computing: http://www.r-project.org

– Surveillance Project: An Open Source R-package disease surveillance framework for "...the development and the evaluation of outbreak detection algorithms in univariate and multivariate routine collected public health surveillance data." http://surveillance.r-forge.r-project.org

• The R package surveillance in Höhle (multiple articles)

• Google's Research Publications: MapReduce Simplified Data Processing on Large Clusters (http://labs.google.com/papers/mapreduce.html)– Hadoop: a software platform that lets one easily write and run applications

that process vast amounts of data (http://hadoop.apache.org/core)

Page 53: Biosurveillance 2.0

OPEN SOURCE SOFTWAREReferences and Related-Efforts

Page 54: Biosurveillance 2.0

REFERENCES• Open Source License References

– http://www.opensource.org/licenses – http://openacs.org/about/licensing/open-source-licensing

• Open Source References– http://www.lifehack.org/articles/technology/open-source-life-how-the-open-

movement-will-change-everything.html – http://en.wikipedia.org/wiki/Open_source – http://www.opensource.org/

• Open Source and Public Health References– http://www.ibiblio.org/pjones/wiki/index.php/

Open_Source_Software_for_Public_Health – http://en.wikipedia.org/wiki/List_of_open_source_healthcare_software – http://www.epha.org/a/320 – Open Source Development for Public Health: A Primer with Examples of Existing

Enterprise Ready Open Source Applications in Turner (2006)– A Quick Survey of Open Source Software for Public Health Organizations in Mirabito

and Kass-Hout (2007)

Page 55: Biosurveillance 2.0

ARCHITECTURAL MATTERSReferences and Related-Efforts

Page 56: Biosurveillance 2.0

REFERENCES• Service Oriented Architecture (or SOA)

– Proposal for Fulfilling Strategic Objectives of the U.S. Roadmap for National Action on Decision Support through a Service—oriented Architecture Leveraging HL7 Services in Kawamoto (2007)

– Service-oriented Architecture in Medical Software: Promises and Perils in Nadkarni (2007)

– Wiki sources:• SOA: http://en.wikipedia.org/wiki/Service_Orientated_Architecture • Semantic service oriented architecture:

http://en.wikipedia.org/wiki/Semantic_service_oriented_architecture • Synchronization Architecture

– InSTEDD’s Mesh4x: http://mesh4x.org • Cloud Architecture

– Google App Engine: Google App Engine Goes Up Against Amazon Web Services in Gartner Report (2008)