50
Design solutions for building a rapid population health analytic platform using PostgreSQL and PostGIS Janos G. Hajagos Department of Biomedical Informatics Stony Brook University PGCONF 2015 March 27, 2015 1

Design solutions for building a rapid population health

Embed Size (px)

Citation preview

Page 1: Design solutions for building a rapid population health

Design solutions for building a rapid population health analytic platform using PostgreSQL and 

PostGISJanos G. Hajagos

Department of Biomedical InformaticsStony Brook University

PGCONF 2015March 27, 2015

1

Page 2: Design solutions for building a rapid population health

DSRIP (Delivery System Reform Incentive Payment )

• New York’s CMS Medicaid 1115 Waiver (April 2014)• Based on performance metrics New York State and health care providers could receive up to $8 Billion

• Move health care from fee‐for‐service to value‐based payment mode• Performing Provider System (PPS) most take a data driven to meeting metrics 

• Integrate care across multiple setting from the inpatient to the outpatient

• Application from each PPS is graded and grades will determine potential payout

2

Page 3: Design solutions for building a rapid population health

Suffolk Care Collaborative

3

Page 4: Design solutions for building a rapid population health

Time, Space, Licensing the Final Frontiers

• NY State and SUNY have licenses for most commercial software at little or minimal cost (Oracle, MSSQL, ArcGIS, SAS)

• Getting the details on the licenses and restrictions (classroom versus production use) is a challenge

• Ordering licenses and getting purchase orders takes time• Installing commercial software is more complicated (configuration, license files, license activation, license servers)

• Keeping track of license can be a real pain• My analytical group had a very tight deadline to meet deliverables!

4

Page 5: Design solutions for building a rapid population health

Can Open Source Software Solutions Compete?

• Health care IT and data analytics solutions traditionally have been propriety and expensive

• We still run mainframes!• SAS ever looked at the price

• Health care IT is driven by fear: • Legal liability• Privacy data breaches (HIPAA)

• Population health outside of direct patient care context can be a proving ground for Open Source solutions

5

Page 6: Design solutions for building a rapid population health

• PostgreSQL is a mature database platform• It is actively developed and has momentum• Easy to install on common Linux distributions and even Windows Server

• PostGIS spatial extension is robust and mature• ANSI Standard SQL Support• I will show how it can be used to build a rapid population health analytic platform

Image source: https://www.flickr.com/photos/pelegrino/2450972179/

PostgreSQL

6

Page 7: Design solutions for building a rapid population health

Deployment Details

• System set up in May of 2014• PostGreSQL 9.3 – Ubuntu 12.04• Deployed within the existing Hospital IT infrastructure• VMWare virtualized environment• 8 Cores• 20 Gigabytes of RAM• 4 terabytes of data storage on a SAN

7

Page 8: Design solutions for building a rapid population health

Outline of Talk

1) The use of schemas for the management and loading of multiple data sources.2) How to process and analyze hospital discharges using the range data functions and operators.3) The integration of the American Community Survey (ACS) data and the geocoded address with PostGIS so as to understand regional differences in health care delivery in Suffolk County, NY.4) Computing behavioral health comorbidities for hospital inpatients using SQLAlchemy, CCS codes and Tableau

8

Page 9: Design solutions for building a rapid population health

Schemas for Fine Grained Access Control

9

Page 10: Design solutions for building a rapid population health

Temporal data – transitions of care

10

Page 11: Design solutions for building a rapid population health

Health Care Data is Often Messy

The assumption is that inpatient stays do not overlap:

In reality data from an Electronic Health Record will look like this:

11

Page 12: Design solutions for building a rapid population health

Deep Breath – PostgreSQL has you covered

12

Image source: https://www.flickr.com/photos/asifhaque/3078893001/

Page 13: Design solutions for building a rapid population health

Representing an Inpatient Stay as a Range

13

[14, 18][14, 19) *

A patient can be discharged at 11:59 pm

Page 14: Design solutions for building a rapid population health

Dealing with Date Ranges for Inpatient Stays

• Convert to Julian Day• cast(to_char(cast("Admission Date" as date), 'J') as int) as start_julian_daycast(to_char(cast("Discharge Date" as date), 'J') as int) as end_julian_day

• Constructor ‐ int4rangeint4range(start_julian_day, end_julian_day + 1, '[)')

• Query uses the following operators:• && Overlaps• <> Not equals

14

Page 15: Design solutions for building a rapid population health

A Rich Set of Range Types Operators

15

http://www.postgresql.org/docs/9.3/static/functions‐range.html

Page 16: Design solutions for building a rapid population health

Computing a Normalized Inpatient StayUPDATE inpatient_admission_test iat0 SET

union_day_range = t.union_day_range

FROM (

SELECT iat1.transaction_id, iat1.union_day_range + iat2.union_day_range AS union_day_range

FROM inpatient_admission_test iat1

JOIN inpatient_admission_test iat2 ON

iat1.patient_id = iat2.patient_id AND iat1.transaction_id !=iat2.transaction_id AND

iat1.union_day_range && iat2.union_day_range and iat1.union_day_range <> iat2.union_day_range) t

WHERE iat0.transaction_id = t.transaction_id;

16

Page 17: Design solutions for building a rapid population health

Iteratively Updating union_day_range

17

Page 18: Design solutions for building a rapid population health

Pair a Patient’s Inpatient Stay

SELECT lower(ier1.union_day_range) ‐upper(ier2.union_day_range) as days_since_paired_discharge, 

ier1.patient_id, ier1.union_day_range as target_union_date, ier2.union_day_range as previous_target_union_date, 

ier1.id as target_id, ier2.id as previous_idFROM inpatient_event_ranges ier1 JOIN inpatient_event_ranges ier2 ON ier1.patient_id 

= ier2.patient_id ANDier2.union_day_range << ier1.union_day_range;

18

Page 19: Design solutions for building a rapid population health

Chained Inpatient Stays

19

Page 20: Design solutions for building a rapid population health

In Conclusion

• In less than 70 lines of codes we have normalized and linked a patient’s inpatient stays

• Code has been applied to 100,000 discharge data set• We are currently using this data set for building predictive analytic models for 30‐day same hospital readmissions

• Synthetic inpatient data and SQL code developed at• https://github.com/jhajagos/SynthMedTopia/

20

Page 21: Design solutions for building a rapid population health

United States of America

NAD83(NSRS2007) / New York Long Island ‐ Projection

21

Page 22: Design solutions for building a rapid population health

Adding Layers from PostGIS in QGIS

22

Page 23: Design solutions for building a rapid population health

Suffolk County, New York

23

Page 24: Design solutions for building a rapid population health

Suffolk County with Postal Code Regions

24

Page 25: Design solutions for building a rapid population health

Western and Central Suffolk County

25

Page 26: Design solutions for building a rapid population health

11746 – Huntington Station and Dix Hills

26

Page 27: Design solutions for building a rapid population health

The Tale of Two Census‐Designated Places

27

Page 28: Design solutions for building a rapid population health

28

Page 29: Design solutions for building a rapid population health

29

Page 30: Design solutions for building a rapid population health

Geocoding• Tiger based geocoder can easily be installed• Requires PostGIS extension to be installed first• Script downloads street address data from the Census Bureau’s website• A good start is at:  http://gis.stackexchange.com/questions/81907/install‐postgis‐and‐tiger‐data‐in‐ubuntu‐12‐04

• In my experience it works better with residential addresses than business addresses

• When it fails it fails badly• Needs a second level check on the quality of match

• No limits and no privacy issues

30

Page 31: Design solutions for building a rapid population health

Using the Tiger Geocoder

SELECT (tt.geo).geomout, (tt.geo).rating, ST_Y((tt.geo).geomout) as latitude, ST_X((tt.geo).geomout) as longitude, tiger.pprint_addy((tt.geo).addy) as 

matched_address, (tt.geo).addy.zip as matched_zip5FROM (select

tiger.geocode(‘?? Suncrest Dr., Dix Hills, NY 11746', 1) as geo) tt;

31

Page 32: Design solutions for building a rapid population health

Geocoding Results

32

Page 33: Design solutions for building a rapid population health

Median Household Income – Census Tracts

33ACS Variable: B19001

Page 34: Design solutions for building a rapid population health

Percent of Households where Spanish is the Primary Language

34ACS Variable: B16002

Page 35: Design solutions for building a rapid population health

New York City

35

Page 36: Design solutions for building a rapid population health

Loading Shape Files

• Shapefiles (shp) and dBase dbf files • Download the appropriate shapefiles

• TIGER/Line Shapefiles FTP site

• GUI / CMD line tools > shp2pgsql ‐s 4269:4269 ‐g geom ‐I ‐W LATIN1   tl_2013_us_county.shp spatial.us_counties > ~/us_counties.sql> psql dsrip < us_counties.sql

36

Page 37: Design solutions for building a rapid population health

Loading American Community Survey data

• Start with American Fact Finder:• http://factfinder.census.gov/

• I created a tool in Python for preprocessing and bulk loading ACS data into PostGreSQL

• https://github.com/jhajagos/CensusGeographyTools

• The geoid allow joining of ACS variables to shapefiles

37

Page 38: Design solutions for building a rapid population health

Spatial Joins ‐ Freedom from Postal CodesSELECT latitude, longitude, ge.matched_address, statefp, geoid, stl.namelsad 

FROM public.geocoding_example ge JOIN spatial.tl_2013_36_tract stl ON ST_intersects(ge.geomout, stl.geom);

SELECT latitude, longitude, ge.matched_address, statefp, geoid, stl.namelsad 

FROM public.geocoding_example ge JOIN spatial.tl_2013_36_bg stlON ST_intersects(ge.geomout, stl.geom);

38

Page 39: Design solutions for building a rapid population health

39

Page 40: Design solutions for building a rapid population health

40

Page 41: Design solutions for building a rapid population health

41

Page 42: Design solutions for building a rapid population health

ST_intersection

42

CREATE table ny_county_trimmed_to_land asSELECT tb.gid, tb.statefp, tb.countyfp, tb.geoid, 

tb.namelsad as name,ST_intersection(lp.new_york_state_land_area_geom, 

tb.geom) as geomFROM  spatial.us_counties tb, 

new_york_state_land_area lpWHERE tb.statefp = '36';

Page 43: Design solutions for building a rapid population health

Spatial Shape Processing 

• PostGIS supports Open Geospatial Consortium (OGC)  standard• Allows more sophisticated processing of spatial shape data

• Bounding boxes• Intersections• Unions• Finding midpoints

43https://www.flickr.com/photos/kulakovich/2152075315

Page 44: Design solutions for building a rapid population health

SPARCS ‐ Statewide Planning and Research Cooperative System• Inpatient discharges are uploaded from every acute care hospital in New York State

• Data is processed and stored by the New York State Department of Health

• Data is made available in several forms:• Fully identified• Obscured dates (month only) and hashed personal identifiers• Personal identifiers and month and day removed

• Used throughout the state for research and health care system planning

44

Page 45: Design solutions for building a rapid population health

Health care Data Files are Delivered Flat

45

Page 46: Design solutions for building a rapid population health

SPARCS Table Normalization using SQLAlchemy• Load SPARCS data into PostgreSQL:

• https://github.com/jhajagos/ny_sparcs_import

• Generates SQL on the fly to normalize repeated columns• https://github.com/jhajagos/agile_data_tools/blob/master/normalize_table_from_columns_that_repeats.py

• SQLAlchemy, a Python library, allow introspection of table structure and data types

• Python has several mature PostgreSQL drivers• Pg8000 – pure Python based library• Psycopg2

46

Page 47: Design solutions for building a rapid population health

Connecting to traditional BI tools

• Tableau

47

Page 48: Design solutions for building a rapid population health

Final Goal is Insight into the Population

48

Page 49: Design solutions for building a rapid population health

Parting Thoughts

• Health care data for population health is not that big• Big data platforms do not currently offer fine grained spatial, temporal, and data handling

• SQL still rules / OGC spatial extensions

• Health care data analyst need to develop a community of sharing• Generation of synthetic data

• Keep improving PostgreSQL!

49

Page 50: Design solutions for building a rapid population health

Acknowledgements• Stony Brook University Department of Biomedical Informatics• Stony Brook Medicine Information Technology

50