Upload
janos-hajagos
View
142
Download
0
Embed Size (px)
Citation preview
Design solutions for building a rapid population health analytic platform using PostgreSQL and
PostGISJanos G. Hajagos
Department of Biomedical InformaticsStony Brook University
PGCONF 2015March 27, 2015
1
DSRIP (Delivery System Reform Incentive Payment )
• New York’s CMS Medicaid 1115 Waiver (April 2014)• Based on performance metrics New York State and health care providers could receive up to $8 Billion
• Move health care from fee‐for‐service to value‐based payment mode• Performing Provider System (PPS) most take a data driven to meeting metrics
• Integrate care across multiple setting from the inpatient to the outpatient
• Application from each PPS is graded and grades will determine potential payout
2
Suffolk Care Collaborative
3
Time, Space, Licensing the Final Frontiers
• NY State and SUNY have licenses for most commercial software at little or minimal cost (Oracle, MSSQL, ArcGIS, SAS)
• Getting the details on the licenses and restrictions (classroom versus production use) is a challenge
• Ordering licenses and getting purchase orders takes time• Installing commercial software is more complicated (configuration, license files, license activation, license servers)
• Keeping track of license can be a real pain• My analytical group had a very tight deadline to meet deliverables!
4
Can Open Source Software Solutions Compete?
• Health care IT and data analytics solutions traditionally have been propriety and expensive
• We still run mainframes!• SAS ever looked at the price
• Health care IT is driven by fear: • Legal liability• Privacy data breaches (HIPAA)
• Population health outside of direct patient care context can be a proving ground for Open Source solutions
5
• PostgreSQL is a mature database platform• It is actively developed and has momentum• Easy to install on common Linux distributions and even Windows Server
• PostGIS spatial extension is robust and mature• ANSI Standard SQL Support• I will show how it can be used to build a rapid population health analytic platform
Image source: https://www.flickr.com/photos/pelegrino/2450972179/
PostgreSQL
6
Deployment Details
• System set up in May of 2014• PostGreSQL 9.3 – Ubuntu 12.04• Deployed within the existing Hospital IT infrastructure• VMWare virtualized environment• 8 Cores• 20 Gigabytes of RAM• 4 terabytes of data storage on a SAN
7
Outline of Talk
1) The use of schemas for the management and loading of multiple data sources.2) How to process and analyze hospital discharges using the range data functions and operators.3) The integration of the American Community Survey (ACS) data and the geocoded address with PostGIS so as to understand regional differences in health care delivery in Suffolk County, NY.4) Computing behavioral health comorbidities for hospital inpatients using SQLAlchemy, CCS codes and Tableau
8
Schemas for Fine Grained Access Control
9
Temporal data – transitions of care
10
Health Care Data is Often Messy
The assumption is that inpatient stays do not overlap:
In reality data from an Electronic Health Record will look like this:
11
Deep Breath – PostgreSQL has you covered
12
Image source: https://www.flickr.com/photos/asifhaque/3078893001/
Representing an Inpatient Stay as a Range
13
[14, 18][14, 19) *
A patient can be discharged at 11:59 pm
Dealing with Date Ranges for Inpatient Stays
• Convert to Julian Day• cast(to_char(cast("Admission Date" as date), 'J') as int) as start_julian_daycast(to_char(cast("Discharge Date" as date), 'J') as int) as end_julian_day
• Constructor ‐ int4rangeint4range(start_julian_day, end_julian_day + 1, '[)')
• Query uses the following operators:• && Overlaps• <> Not equals
14
A Rich Set of Range Types Operators
15
http://www.postgresql.org/docs/9.3/static/functions‐range.html
Computing a Normalized Inpatient StayUPDATE inpatient_admission_test iat0 SET
union_day_range = t.union_day_range
FROM (
SELECT iat1.transaction_id, iat1.union_day_range + iat2.union_day_range AS union_day_range
FROM inpatient_admission_test iat1
JOIN inpatient_admission_test iat2 ON
iat1.patient_id = iat2.patient_id AND iat1.transaction_id !=iat2.transaction_id AND
iat1.union_day_range && iat2.union_day_range and iat1.union_day_range <> iat2.union_day_range) t
WHERE iat0.transaction_id = t.transaction_id;
16
Iteratively Updating union_day_range
17
Pair a Patient’s Inpatient Stay
SELECT lower(ier1.union_day_range) ‐upper(ier2.union_day_range) as days_since_paired_discharge,
ier1.patient_id, ier1.union_day_range as target_union_date, ier2.union_day_range as previous_target_union_date,
ier1.id as target_id, ier2.id as previous_idFROM inpatient_event_ranges ier1 JOIN inpatient_event_ranges ier2 ON ier1.patient_id
= ier2.patient_id ANDier2.union_day_range << ier1.union_day_range;
18
Chained Inpatient Stays
19
In Conclusion
• In less than 70 lines of codes we have normalized and linked a patient’s inpatient stays
• Code has been applied to 100,000 discharge data set• We are currently using this data set for building predictive analytic models for 30‐day same hospital readmissions
• Synthetic inpatient data and SQL code developed at• https://github.com/jhajagos/SynthMedTopia/
20
United States of America
NAD83(NSRS2007) / New York Long Island ‐ Projection
21
Adding Layers from PostGIS in QGIS
22
Suffolk County, New York
23
Suffolk County with Postal Code Regions
24
Western and Central Suffolk County
25
11746 – Huntington Station and Dix Hills
26
The Tale of Two Census‐Designated Places
27
28
29
Geocoding• Tiger based geocoder can easily be installed• Requires PostGIS extension to be installed first• Script downloads street address data from the Census Bureau’s website• A good start is at: http://gis.stackexchange.com/questions/81907/install‐postgis‐and‐tiger‐data‐in‐ubuntu‐12‐04
• In my experience it works better with residential addresses than business addresses
• When it fails it fails badly• Needs a second level check on the quality of match
• No limits and no privacy issues
30
Using the Tiger Geocoder
SELECT (tt.geo).geomout, (tt.geo).rating, ST_Y((tt.geo).geomout) as latitude, ST_X((tt.geo).geomout) as longitude, tiger.pprint_addy((tt.geo).addy) as
matched_address, (tt.geo).addy.zip as matched_zip5FROM (select
tiger.geocode(‘?? Suncrest Dr., Dix Hills, NY 11746', 1) as geo) tt;
31
Geocoding Results
32
Median Household Income – Census Tracts
33ACS Variable: B19001
Percent of Households where Spanish is the Primary Language
34ACS Variable: B16002
New York City
35
Loading Shape Files
• Shapefiles (shp) and dBase dbf files • Download the appropriate shapefiles
• TIGER/Line Shapefiles FTP site
• GUI / CMD line tools > shp2pgsql ‐s 4269:4269 ‐g geom ‐I ‐W LATIN1 tl_2013_us_county.shp spatial.us_counties > ~/us_counties.sql> psql dsrip < us_counties.sql
36
Loading American Community Survey data
• Start with American Fact Finder:• http://factfinder.census.gov/
• I created a tool in Python for preprocessing and bulk loading ACS data into PostGreSQL
• https://github.com/jhajagos/CensusGeographyTools
• The geoid allow joining of ACS variables to shapefiles
37
Spatial Joins ‐ Freedom from Postal CodesSELECT latitude, longitude, ge.matched_address, statefp, geoid, stl.namelsad
FROM public.geocoding_example ge JOIN spatial.tl_2013_36_tract stl ON ST_intersects(ge.geomout, stl.geom);
SELECT latitude, longitude, ge.matched_address, statefp, geoid, stl.namelsad
FROM public.geocoding_example ge JOIN spatial.tl_2013_36_bg stlON ST_intersects(ge.geomout, stl.geom);
38
39
40
41
ST_intersection
42
CREATE table ny_county_trimmed_to_land asSELECT tb.gid, tb.statefp, tb.countyfp, tb.geoid,
tb.namelsad as name,ST_intersection(lp.new_york_state_land_area_geom,
tb.geom) as geomFROM spatial.us_counties tb,
new_york_state_land_area lpWHERE tb.statefp = '36';
Spatial Shape Processing
• PostGIS supports Open Geospatial Consortium (OGC) standard• Allows more sophisticated processing of spatial shape data
• Bounding boxes• Intersections• Unions• Finding midpoints
43https://www.flickr.com/photos/kulakovich/2152075315
SPARCS ‐ Statewide Planning and Research Cooperative System• Inpatient discharges are uploaded from every acute care hospital in New York State
• Data is processed and stored by the New York State Department of Health
• Data is made available in several forms:• Fully identified• Obscured dates (month only) and hashed personal identifiers• Personal identifiers and month and day removed
• Used throughout the state for research and health care system planning
44
Health care Data Files are Delivered Flat
45
SPARCS Table Normalization using SQLAlchemy• Load SPARCS data into PostgreSQL:
• https://github.com/jhajagos/ny_sparcs_import
• Generates SQL on the fly to normalize repeated columns• https://github.com/jhajagos/agile_data_tools/blob/master/normalize_table_from_columns_that_repeats.py
• SQLAlchemy, a Python library, allow introspection of table structure and data types
• Python has several mature PostgreSQL drivers• Pg8000 – pure Python based library• Psycopg2
46
Connecting to traditional BI tools
• Tableau
47
Final Goal is Insight into the Population
48
Parting Thoughts
• Health care data for population health is not that big• Big data platforms do not currently offer fine grained spatial, temporal, and data handling
• SQL still rules / OGC spatial extensions
• Health care data analyst need to develop a community of sharing• Generation of synthetic data
• Keep improving PostgreSQL!
49
Acknowledgements• Stony Brook University Department of Biomedical Informatics• Stony Brook Medicine Information Technology
50