Solving the Disconnected Data Problem in Healthcare Using MongoDB

Preview:

DESCRIPTION

The data diversity in healthcare and life sciences is exploding and the market is fundamentally changing as a result of healthcare reform. The result is more and more data but it is compartmentalized and disconnected. At Zephyr Health, we have developed a data platform that is able to provide connectivity between thousands of healthcare data assets using an ontology driven approach storing data in MongoDB. This session will show how we break down this very challenging problem and how some of MongoDBs more recent features have been utilized to do so.

Citation preview

Sven Junkergård - CTO

Solving the Disconnected Data

Problem in Healthcare Using

MongoDB

A MongoSF talk – December 3rd 2014

• MSc Computer Science and Engineering – Chalmers University

of Technology in Gothenburg

• AMS, Capgemini

• Cake Financial – aggregating retail investor portfolios and

generating investment insights from the best of the best

• Billfloat – novel financial credit product with highly differentiated

underwriting method

• Zephyr Health – built out technology and engineering team to

deliver on a big vision – integrate disconnected data in

healthcare and solve real problems. Now CTO.

ME

I am a reformed consultant who used to do architecture consulting…

2

3

• Life Sciences

• Brand Management

• Big Data

• Applied Mathematics

• Algorithms

• IaaS | SaaS | PaaS

San FranciscoLondon

India

OFFICE LOCATIONS

ORGANIZATIONAL

EXPERTISE

CURRENT CLIENTSInclude members of:

GLOBAL TOP 5

BIOPHARM

GLOBAL TOP 5

PHARM

GLOBAL TOP 5

MEDICAL

DEVICES

WHO I WORK FOR – ZEPHYR HEALTH

• Machine Learning

• Artificial Intelligence

• Statistics & Modeling

• Data Science

• Visualization

• App Development

OUR FOCUS

• Organize disconnected data in healthcare and life science

• Visualize the combination of heterogeneous data sources in analytical problems

• Solve important and challenging problems for our customers

V

V

V

Volume

Velocity

Variety

V Visualization

SOLVING THE VARIETY PROBLEM

4

Genomic sequencing

Streaming device data

Understanding healthcare

landscape and treatment

effectiveness

Healthcare example

• Image sources: illumina and iRhythm

Internal Vendor Public

Providing relevant and

powerful visualizations

that provide real insights

Data trends

WHY HEALTHCARE DATA IS A DIFFERENT WORLD ENTIRELY

5

Loan application decision Clinical trial investigator decision

• Research

• Published trials

• Current sponsored trials

• Prescriptions

• Claims

• Funding

• Network leadership

• Site profile

• Site certification

• Site statistics

Applicant demographics

Bank

account

Credit

report

Identity

check

Income

verification

SSN

SSN SSN

SSN SSN

Investigator

Site

Patients

Inconsis

tent o

r mis

sin

g k

eys

THE TYPES OF PROBLEMS THAT CAN BE SOLVED

WITH INTEGRATED DISPARATE DATA

Problem What is it?

Site selectionFinding the right locations to house clinical trials

Trail outcomesVisualizing data from different sources within clinical

trials

Medical expertise

communication

Identifying the healthcare professionals with the right

expertise

Scoring and rankingFinding the top ranking healthcare professionals or

institutions for a particular purpose

Network leadership

analysis

Understanding who is connected to who and how

information is disseminated

Care delivery

effectiveness

Identifying areas of great or poor performance and the

underlying reason

Patient outcomesRelating patient outcomes to specific market activities

Health economicsUnderstanding the financial effectiveness of an

intervention or introducing a new standard or care

6

DATA CATEGORIES AND EXAMPLES

Keys Controlled Vendor specific Anything and nothing

FormatsSpreadsheets

(structured) Flat files Anything

Managing variety is the key to solving the problem

Sales

Speakers

Partners

CRM

Payments

Trials

Internal

Rx

Claims

Primary research

Consulting

Referral patterns

Vendors

Providers

Grants

Public trials

Research

Public

Creating a complete picture requires combining disconnected data from

an enormous variety of sources

7

Managing data variety is the key to solving the problem

A DIFFERENT PROBLEM REQUIRES A DIFFERENT SOLUTION

Instead…

• A different data model based on

descriptive meta data

• A non-traditional data store

• Something other than Informatica

• Automated intelligent algorithms

• A few special tricks

• An API

• Some really great applications...

8

OLAP Cube BI Insigh

t

ETL DW DM

ENTITY CENTRIC DATA MODEL

Entity

table

Data

source 1

Data

source 2

Data

source n

Entity

Attributes

Entity

Attributes

Entity

Attributes

Traditional, relational model Entity centric model

Meta

data

……

……

……

……

……

……

……

……

……

……

……

……

……

ONTOLOGY-BASED DEVELOPMENT

10

Requirements• Flexible

• Extensible and adaptive

• Easy to maintain

Solution• Ontology: used to formally represent knowledge within a

domain

• Vocabulary: Collection of entities, attributes, relationships

that provides context within the domain

• Taxonomy (Classification): A hierarchical collection of

controlled terms from vocabulary

VOCABULARY

11

Entities

Organic Attributes

Derived Attributes

Entity Relationships

Real world things or eventsE.g. Institution, patient, sales,

potential, etc.

Data points coming from datasets

E.g. first_name, age, revenue, date, etc.

Relationships between different entities

Processed key-value pairs from existing organic and/or derived

attributes

WHY MONGODB?

Our requirements• Extremely flexible data storage• Low cost of evolving schema• Highly performant for complex joints, recursive queries etc• Scalable to large volumes of connected information

MongoDB: • Document store is a great fit for storing arbitrary information• Key-value pair in JSON format – (allowed for both adding data traceability and

cheap data evolution)• Secondary indexes and strict consistency• Map-reduce functionality

Challenges:• Queries are powerful but not easy to write• We needed complex joints across arbitrary information (how do you create an

index on something you don’t even know what it is ahead of time?)

12

DATA ORGANIZATION

13

Full Profile

Main ProfileEntity

RelationshipsAttribute

References

Identity Section

Attributes (Organic + Derived)

dataset dataset_recordsFile

InfoRaw Data

Geo locations

DATA INTEGRATION

14

{

first_name: Charles

last_name: Morris

street: 200 First St.

city: Rochester

state: MN

zip: 55905

phone: 802-555-1234

email: cmorris@mayoclinic.com

headshot: <AF6713…>

thought_leader_score: 8

pub_count: 203

}

DISPARATE SOURCESOF INFORMATION

STRUCTUREDPROFILE

APPLICATIONREPRESENTATION

All enabled through a series of data integration algorithms

ALGORITHM EXAMPLES

15

Disambiguation

Dataset identification

Clustering

Record linkage

C MorrisHeart and Vascular Center

123 Main St

Rochester, MN 55903

802-555-9988

Charles “Chuck” MorrisCardiologist

200 First St.

Rochester, MN 55905

802-555-1234

cmorris@mayoclinic.com

??Automatically choosing

the most authoritative

version of an attribute

Maximizing re-use of

meta data describing

imported data sets

Pre-calculating clusters

in weakly attributed data

ILLUSTRATIVE MONGODB PROFILE

{

“_id” : “53bcf9cae4b03f352d4b47c7“,

"identity": {"npi": "1",

"specialty": ["Cardiologist”],

"first_name": "Tom",

"last_name": "Smith”},

"attributes": {

"npi": {1},

"first_name": {"Tom”},

"last_name": {"Smith”},

"specialty": {"Cardiologist”}

}

}

16

NPI FirstName LastName Specialty

1 Tom Smith Cardiologist

ADDING ADDITIONAL ATTRIBUTES

{

“_id” : “53bcf9cae4b03f352d4b47c7“,

"identity": {"npi": "1",

"specialty": ["Cardiologist”],

"first_name": "Tom",

"last_name": "Smith”},

"attributes": {

"npi": {1},

"first_name": {"Tom”},

"last_name": {"Smith”},

"specialty": {"Cardiologist”},

"institution": {"UCSF Medical Center”},

"clinical_trial": {"Heart Valve Clinical Trial”},

"start_date": {"01/01/2011”},

"end_date": {"03/25/2013”}

}

}

17

NPI FirstName LastName Specialty

1 Tom Smith Cardiologist

NPI Institutio

n

ClinicalTrial Name Start Date End Date

1 UCSF

Medical

Center

Heart Valve Clinical

Trial

01/01/2011 03/25/2013

TRICKS TO TAME THE WILD DATA

• Ontology – how we keep track of all ingested information

• Vocabulary – bringing structure to large variety of information

• Derived attributes – encapsulate complexity

• GIS transformations – practical integration of geo data

• Indexing – fast access to complex information in MongoDB

18

DERIVED ATTRIBUTES

What’s the problem?• Data is rarely clean and business rules are

complex

What are we doing about it?• Use existing (organic) attributes and apply

rules to generate new (derived) attributes

• Derived attributes generated through

queries or map-reduce jobs

Why it matters• Too complex and expensive to consider all

business rules at run-time with every query

• Hides the complexity and introduces

uniformity

19

Entity

Attributes

GEOSPATIAL MAPPING APPROACH FOR

AWKWARD GEO DATA

20

Using traditional method

Reporting unit

Postal codes

Stuttgart District

Using geospatial method

Geocoded reporting unit

State

• Additional challenges with mismatches

between

reporting unit postal codes and mapping

postal codes

• Have to compensate for missing postal

codes

• Split patients or metrics across multiple

regions

when reporting unit spans multiple regions

Mapping + calculations

Baden-Württemberg

Mapping + calculation

State

Baden-Württemberg

Stuttgart

District

• Requires determining a single central point for each

reporting unit

• Uses no mapping documents

• No compensatory calculations required

• Overall accuracy increases

701737017370173

INDEXING

Why MongoDB alone does not get it done• Cross collection queries required for large number of scenarios

• Indexing challenges when dealing with unknown information

What we did• Graph based index

• Entities and attributes are nodes

• Entity – attribute ownership and entity to entity relationships are edges

How we use it• zQueries allow us to do complex

queries from web front ends

21

Disconnected Data Apps for Life Sciences

Algorithm Driven

Data Ingestion

Synchronization

Proprietary REST API

zQuery

Internal Vendor Public

Data Organized in

Connected Profile

Documents

Graph Based

Materialized

Query Index

Ontology Driven Data Tier

100,000,000+ data points ingested and indexed each year

THE ZEPHYR PLATFORM

100,000,000+ data points ingested and indexed each year

22

Zephyr Platform

Ontology Driven

Data Store

A

P

I

REST API

Exposes both data and the

ontology

zQueries

jSON based query language for

queries against dynamic and

connected data

Functional Focus

Solving specific business problem

with focused apps

Design

Single page apps with targeted

data visualizations

Analytical Apps

CONSUMING INTEGRATED DISPARATE DATA

Analytical applications use the zAPI and the ontology to produce

applications that adapt to changing data

23

TARGETED ANALYTICAL APPLICATIONS

Apps for real business problems leveraged by everyday business users

Illuminate

Voyager Kaleidoscope

24

Lighthouse

A BRIEF DEMO

25

LEARNINGS

• There was no one technology or one database that provided a

compete solution embrace diversity

• Create generic platform, pour effort into specialized

algorithms to populate data intelligently

• Ontology driven development can be very powerful but data

organization still a challenge

• Indexing on a priori unknown attributes is challenging

• Data modeling is always important, large profiles had to be

broken down

26

SUMMARY

Wrapping it all up in five points

1. Healthcare is different and has lots of critical data that is disconnected

2. Generic, MongoDB-based data storage model using meta-data

3. Data integration powered by algorithms

4. Document profiles for facts, graph for querying

5. Diverse set of end user analytical applications powered by the generic data

platform

Why this matters

• Standards are really important, but slow to develop

• Huge amount of change occurring in our healthcare system

• We need to make decisions today based on available data sets despite existing

challenges

27

THANK YOU!

Brian Roy – Strategy and architecture

Mahesh Chaudhari – Database architecture

Cesar Arevalo – Data integration implementation

The guys that made all of it come together!

28

Zephyr Health

450 Mission St. Suite 201

San Francisco, California 94105

+1.415.529.7649

zephyrhealth.com

CTO

+1.415.503.7412

sven@zephyrhealth.com

Sven

Junkergård

29

CONTACT INFORMATION

BACKUP SCREEN SHOTS

30

ILLUMINATE – LANDING PAGE

ILLUMINATE – ALL CASES VIEW

ILLUMINATE – GRID VIEW

ILLUMINATE – GRAPH VIEW

ILLUMINATE – PROFILE VIEW

Recommended