Solving the Disconnected Data Problem in Healthcare Using MongoDB

Sven Junkergård - CTO

Solving the Disconnected Data

Problem in Healthcare Using

MongoDB

A MongoSF talk – December 3rd 2014

• MSc Computer Science and Engineering – Chalmers University

of Technology in Gothenburg

• AMS, Capgemini

• Cake Financial – aggregating retail investor portfolios and

generating investment insights from the best of the best

• Billfloat – novel financial credit product with highly differentiated

underwriting method

• Zephyr Health – built out technology and engineering team to

deliver on a big vision – integrate disconnected data in

healthcare and solve real problems. Now CTO.

I am a reformed consultant who used to do architecture consulting…

• Life Sciences

• Brand Management

• Big Data

• Applied Mathematics

• Algorithms

• IaaS | SaaS | PaaS

San FranciscoLondon

OFFICE LOCATIONS

ORGANIZATIONAL

EXPERTISE

CURRENT CLIENTSInclude members of:

GLOBAL TOP 5

BIOPHARM

GLOBAL TOP 5

MEDICAL

DEVICES

WHO I WORK FOR – ZEPHYR HEALTH

• Machine Learning

• Artificial Intelligence

• Statistics & Modeling

• Data Science

• Visualization

• App Development

OUR FOCUS

• Organize disconnected data in healthcare and life science

• Visualize the combination of heterogeneous data sources in analytical problems

• Solve important and challenging problems for our customers

Volume

Velocity

Variety

V Visualization

SOLVING THE VARIETY PROBLEM

Genomic sequencing

Streaming device data

Understanding healthcare

landscape and treatment

effectiveness

Healthcare example

• Image sources: illumina and iRhythm

Internal Vendor Public

Providing relevant and

powerful visualizations

that provide real insights

Data trends

WHY HEALTHCARE DATA IS A DIFFERENT WORLD ENTIRELY

Loan application decision Clinical trial investigator decision

• Research

• Published trials

• Current sponsored trials

• Prescriptions

• Claims

• Funding

• Network leadership

• Site profile

• Site certification

• Site statistics

Applicant demographics

account

Credit

report

Identity

Income

verification

SSN SSN

Investigator

Patients

Inconsis

tent o

THE TYPES OF PROBLEMS THAT CAN BE SOLVED

WITH INTEGRATED DISPARATE DATA

Problem What is it?

Site selectionFinding the right locations to house clinical trials

Trail outcomesVisualizing data from different sources within clinical

trials

Medical expertise

communication

Identifying the healthcare professionals with the right

expertise

Scoring and rankingFinding the top ranking healthcare professionals or

institutions for a particular purpose

Network leadership

analysis

Understanding who is connected to who and how

information is disseminated

Care delivery

effectiveness

Identifying areas of great or poor performance and the

underlying reason

Patient outcomesRelating patient outcomes to specific market activities

Health economicsUnderstanding the financial effectiveness of an

intervention or introducing a new standard or care

DATA CATEGORIES AND EXAMPLES

Keys Controlled Vendor specific Anything and nothing

FormatsSpreadsheets

(structured) Flat files Anything

Managing variety is the key to solving the problem

Speakers

Partners

Payments

Trials

Internal

Claims

Primary research

Consulting

Referral patterns

Vendors

Providers

Grants

Public trials

Research

Public

Creating a complete picture requires combining disconnected data from

an enormous variety of sources

Managing data variety is the key to solving the problem

A DIFFERENT PROBLEM REQUIRES A DIFFERENT SOLUTION

Instead…

• A different data model based on

descriptive meta data

• A non-traditional data store

• Something other than Informatica

• Automated intelligent algorithms

• A few special tricks

• An API

• Some really great applications...

OLAP Cube BI Insigh

ETL DW DM

ENTITY CENTRIC DATA MODEL

Entity

source 1

source 2

source n

Entity

Attributes

Entity

Attributes

Entity

Attributes

Traditional, relational model Entity centric model

……

ONTOLOGY-BASED DEVELOPMENT

Requirements• Flexible

• Extensible and adaptive

• Easy to maintain

Solution• Ontology: used to formally represent knowledge within a

domain

• Vocabulary: Collection of entities, attributes, relationships

that provides context within the domain

• Taxonomy (Classification): A hierarchical collection of

controlled terms from vocabulary

VOCABULARY

Entities

Organic Attributes

Derived Attributes

Entity Relationships

Real world things or eventsE.g. Institution, patient, sales,

potential, etc.

Data points coming from datasets

E.g. first_name, age, revenue, date, etc.

Relationships between different entities

Processed key-value pairs from existing organic and/or derived

attributes

WHY MONGODB?

Our requirements• Extremely flexible data storage• Low cost of evolving schema• Highly performant for complex joints, recursive queries etc• Scalable to large volumes of connected information

MongoDB: • Document store is a great fit for storing arbitrary information• Key-value pair in JSON format – (allowed for both adding data traceability and

cheap data evolution)• Secondary indexes and strict consistency• Map-reduce functionality

Challenges:• Queries are powerful but not easy to write• We needed complex joints across arbitrary information (how do you create an

index on something you don’t even know what it is ahead of time?)

DATA ORGANIZATION

Full Profile

Main ProfileEntity

RelationshipsAttribute

References

Identity Section

Attributes (Organic + Derived)

dataset dataset_recordsFile

InfoRaw Data

Geo locations

DATA INTEGRATION

first_name: Charles

last_name: Morris

street: 200 First St.

city: Rochester

state: MN

zip: 55905

phone: 802-555-1234

email: cmorris@mayoclinic.com

headshot: <AF6713…>

thought_leader_score: 8

pub_count: 203

DISPARATE SOURCESOF INFORMATION

STRUCTUREDPROFILE

APPLICATIONREPRESENTATION

All enabled through a series of data integration algorithms

ALGORITHM EXAMPLES

Disambiguation

Dataset identification

Clustering

Record linkage

C MorrisHeart and Vascular Center

123 Main St

Rochester, MN 55903

802-555-9988

Charles “Chuck” MorrisCardiologist

200 First St.

Rochester, MN 55905

802-555-1234

cmorris@mayoclinic.com

??Automatically choosing

the most authoritative

version of an attribute

Maximizing re-use of

meta data describing

imported data sets

Pre-calculating clusters

in weakly attributed data

ILLUSTRATIVE MONGODB PROFILE

“_id” : “53bcf9cae4b03f352d4b47c7“,

"identity": {"npi": "1",

"specialty": ["Cardiologist”],

"first_name": "Tom",

"last_name": "Smith”},

"attributes": {

"npi": {1},

"first_name": {"Tom”},

"last_name": {"Smith”},

"specialty": {"Cardiologist”}

NPI FirstName LastName Specialty

1 Tom Smith Cardiologist

ADDING ADDITIONAL ATTRIBUTES

“_id” : “53bcf9cae4b03f352d4b47c7“,

"identity": {"npi": "1",

"specialty": ["Cardiologist”],

"first_name": "Tom",

"last_name": "Smith”},

"attributes": {

"npi": {1},

"first_name": {"Tom”},

"last_name": {"Smith”},

"specialty": {"Cardiologist”},

"institution": {"UCSF Medical Center”},

"clinical_trial": {"Heart Valve Clinical Trial”},

"start_date": {"01/01/2011”},

"end_date": {"03/25/2013”}

NPI FirstName LastName Specialty

1 Tom Smith Cardiologist

NPI Institutio

ClinicalTrial Name Start Date End Date

1 UCSF

Medical

Center

Heart Valve Clinical

01/01/2011 03/25/2013

TRICKS TO TAME THE WILD DATA

• Ontology – how we keep track of all ingested information

• Vocabulary – bringing structure to large variety of information

• Derived attributes – encapsulate complexity

• GIS transformations – practical integration of geo data

• Indexing – fast access to complex information in MongoDB

DERIVED ATTRIBUTES

What’s the problem?• Data is rarely clean and business rules are

complex

What are we doing about it?• Use existing (organic) attributes and apply

rules to generate new (derived) attributes

• Derived attributes generated through

queries or map-reduce jobs

Why it matters• Too complex and expensive to consider all

business rules at run-time with every query

• Hides the complexity and introduces

uniformity

Entity

Attributes

GEOSPATIAL MAPPING APPROACH FOR

AWKWARD GEO DATA

Using traditional method

Reporting unit

Postal codes

Stuttgart District

Using geospatial method

Geocoded reporting unit

• Additional challenges with mismatches

between

reporting unit postal codes and mapping

postal codes

• Have to compensate for missing postal

• Split patients or metrics across multiple

regions

when reporting unit spans multiple regions

Mapping + calculations

Baden-Württemberg

Mapping + calculation

Baden-Württemberg

Stuttgart

District

• Requires determining a single central point for each

reporting unit

• Uses no mapping documents

• No compensatory calculations required

• Overall accuracy increases

701737017370173

INDEXING

Why MongoDB alone does not get it done• Cross collection queries required for large number of scenarios

• Indexing challenges when dealing with unknown information

What we did• Graph based index

• Entities and attributes are nodes

• Entity – attribute ownership and entity to entity relationships are edges

How we use it• zQueries allow us to do complex

queries from web front ends

Disconnected Data Apps for Life Sciences

Algorithm Driven

Data Ingestion

Synchronization

Proprietary REST API

zQuery

Internal Vendor Public

Data Organized in

Connected Profile

Documents

Graph Based

Materialized

Query Index

Ontology Driven Data Tier

100,000,000+ data points ingested and indexed each year

THE ZEPHYR PLATFORM

100,000,000+ data points ingested and indexed each year

Zephyr Platform

Ontology Driven

Data Store

REST API

Exposes both data and the

ontology

zQueries

jSON based query language for

queries against dynamic and

connected data

Functional Focus

Solving specific business problem

with focused apps

Design

Single page apps with targeted

data visualizations

Analytical Apps

CONSUMING INTEGRATED DISPARATE DATA

Analytical applications use the zAPI and the ontology to produce

applications that adapt to changing data

TARGETED ANALYTICAL APPLICATIONS

Apps for real business problems leveraged by everyday business users

Illuminate

Voyager Kaleidoscope

Lighthouse

A BRIEF DEMO

LEARNINGS

• There was no one technology or one database that provided a

compete solution embrace diversity

• Create generic platform, pour effort into specialized

algorithms to populate data intelligently

• Ontology driven development can be very powerful but data

organization still a challenge

• Indexing on a priori unknown attributes is challenging

• Data modeling is always important, large profiles had to be

broken down

SUMMARY

Wrapping it all up in five points

1. Healthcare is different and has lots of critical data that is disconnected

2. Generic, MongoDB-based data storage model using meta-data

3. Data integration powered by algorithms

4. Document profiles for facts, graph for querying

5. Diverse set of end user analytical applications powered by the generic data

platform

Why this matters

• Standards are really important, but slow to develop

• Huge amount of change occurring in our healthcare system

• We need to make decisions today based on available data sets despite existing

challenges

THANK YOU!

Brian Roy – Strategy and architecture

Mahesh Chaudhari – Database architecture

Cesar Arevalo – Data integration implementation

The guys that made all of it come together!

Zephyr Health

450 Mission St. Suite 201

San Francisco, California 94105

+1.415.529.7649

zephyrhealth.com

+1.415.503.7412

sven@zephyrhealth.com

Junkergård

CONTACT INFORMATION

BACKUP SCREEN SHOTS

ILLUMINATE – LANDING PAGE

ILLUMINATE – ALL CASES VIEW

ILLUMINATE – GRID VIEW

ILLUMINATE – GRAPH VIEW

ILLUMINATE – PROFILE VIEW

Solving the Disconnected Data Problem in Healthcare Using MongoDB

Technology

MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines

MongoDB€¦ · mongoDB یفغؼه مّص لوف mongoDB عص صْجْه نیُبفه مْؿ لوف mongoDB بث عبک عّغك معبِچ لوف بُْج ّ ؽغپ نجٌپ لوف

MongoDB and using MongoDB with .NET

MongoDB Days Silicon Valley: Introducing MongoDB 3.2

Disconnected Applicationblog.stikom.edu/meli/files/2011/12/6.-Disconnected-Application.pdfDisconnected Data Access Architecture of ADO.NET •ADO.NET introduces the concept of disconnected

MongoDB Europe 2016 - Graph Operations with MongoDB

MongoDB Europe 2016 - Distributed Ledgers, Blockchain + MongoDB

SFT2841 disconnected

Data, Disconnected

MongoDB Profiler Deep Dive; MongoDB Austin 2013

Morning with MongoDB Paris 2012 - MongoDB Basic Concepts

Disconnected Defaulters

MongoDB World 2016: MongoDB & IBM

IBM Disconnected Log Collector: IBM Disconnected Log ...4 IBM Disconnected Log Collector: IBM Disconnected Log Collector Guide System requirements for Disconnected Log Collector IBM

MongoDB Backups and Disaster Recovery - Austin MongoDB Meetup

MongoDB 3.0 migration - MongoDB Days Munich

MongoDB - derickrethans.nlderickrethans.nl/talks/mongo-bbmw.pdf10gen, the company behind MongoDB 10gen began the MongoDB project Development, support, and services for MongoDB 100

MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins

Vcare - Alarm system|home alarm system|wireless alarm ... · Gas Detector Disconnected Emergency Button Disconnected Medical Call Button Disconnected Detectors Disconnected Door Sensor

Introduction to MongoDB - nymph332088.github.io€¦ · Introduction to MongoDB ... MongoDB Analysis MongoDB Demo with Large-scale Data . ... Beyond the data modeling SQL language