33
INGESTING COMPLEX HEALTHCARE DATA WITH APACHE KAFKA Micah Whitacre @mkwhit #kafkasummit

Ingesting Healthcare Data, Micah Whitacre

Embed Size (px)

Citation preview

Page 1: Ingesting Healthcare Data, Micah Whitacre

INGESTING COMPLEX HEALTHCARE DATA WITH APACHE KAFKA

Micah Whitacre@mkwhit

#kafkasummit

Page 2: Ingesting Healthcare Data, Micah Whitacre

Leader Healthcare IT

~30% of all US Healthcare Data in a Cerner Solution

Page 3: Ingesting Healthcare Data, Micah Whitacre

Sepsis Alerting(minutes)

Doctor’s Office

Minute Clinic

ERHospital

Specialist

Page 4: Ingesting Healthcare Data, Micah Whitacre

Ambulatory(<2 seconds)

Page 5: Ingesting Healthcare Data, Micah Whitacre

Ambulatory(<2 seconds)

Page 6: Ingesting Healthcare Data, Micah Whitacre

Ambulatory(<2 seconds)

Page 7: Ingesting Healthcare Data, Micah Whitacre

Table Table.NOTIFY

Google PercolatorNoSQL

Page 8: Ingesting Healthcare Data, Micah Whitacre

Table Table.NOTIFY

Page 9: Ingesting Healthcare Data, Micah Whitacre

Table Table.NOTIFY

Collector

HTTP

Page 10: Ingesting Healthcare Data, Micah Whitacre

Was successful… for awhile

Progressed from minutes to seconds

Hit a wall preventing going faster (missed SLAs)

Page 11: Ingesting Healthcare Data, Micah Whitacre

NoSQL

NoSQL

NoSQL

Collector

Collector

Collector

Crawler

Crawler

Crawler

Page 12: Ingesting Healthcare Data, Micah Whitacre

Solution A

Solution B

Solution C

Collector

Collector

Collector

Crawler

Crawler

Crawler

Page 13: Ingesting Healthcare Data, Micah Whitacre

Use the right tool for the job!

NoSQL != Distributed Queue

Anti-patterns apply to everyone eventually

Page 14: Ingesting Healthcare Data, Micah Whitacre

Our scalability should not impact crawlers

Cluster sprawl should be avoided

Reduce the number of copies

Page 15: Ingesting Healthcare Data, Micah Whitacre

Table Table.NOTIFY

NoSQL

Page 16: Ingesting Healthcare Data, Micah Whitacre

Table Kafka Topic

Page 17: Ingesting Healthcare Data, Micah Whitacre

Kafka Base Notifications

● Kafka topic per listener● Small Google Protobuf payloads

○ Gzip based compression for higher compression● Could minimize to fewer listeners

○ Single topic and partition vs 100s of NoSQL rows● Able to give up fairness concerns in favor of speed

Page 18: Ingesting Healthcare Data, Micah Whitacre

NoSQL

NoSQL

NoSQL

Collector

Crawler

Crawler

Crawler

Page 19: Ingesting Healthcare Data, Micah Whitacre

Kafka Staging Area● Single location for one copy of the data● Consumption based on type and source of data

○ 500ish of types and 100-1000 sources○ Choose source based topics to cut down on topics○ Default to 8 partitions

● Snappy compression for low latency● Huge variation in data sizes and frequency

○ Infrequent MB - GB file uploads (daily, weekly, monthly, yearly)○ Streaming uploads of 100B-10MB

● Time based retention to prevent data loss○ Ambitiously set to 30 days but lowered to 7 days○ Archive data to HDFS for reprocessing or lagging/offline consumers

Page 20: Ingesting Healthcare Data, Micah Whitacre

Kafka Payloads And Delivery

● Avro Schema to wrap ingested data○ Source, Type, Id, Version, Value (byte[]), Metadata

(byte[]), Properties○ Common payload regardless of actual byte[]

● Set threshold for payloads stored in Kafka○ Store 95-98% of data in Kafka○ Data larger than 50 MB stored in HDFS with path

stored in Avro wrapper

Page 21: Ingesting Healthcare Data, Micah Whitacre

● Rate of ingestion changes with Kafka○ Lack of backpressure can increase rate of ingestion○ Capacity and retention planning could end up

inaccurate

Most Surprising Lesson Learned

Page 22: Ingesting Healthcare Data, Micah Whitacre

Weeks

msg

/sec

Initial Crawl - NoSQL

Crawl all historical data

Crawling only recent changes

Page 23: Ingesting Healthcare Data, Micah Whitacre

Rate of Data Ingested Per Day

By Source

Number of Sources

Number of Days to keep Kafka

Total Storage Needed in Kafka

Page 24: Ingesting Healthcare Data, Micah Whitacre

Days

msg

/sec

Initial Crawl - Kafka

Crawls from weeks to daysCrawl all historical data

Crawling only recent changes

Page 25: Ingesting Healthcare Data, Micah Whitacre

Rate of Data Ingested Per Day

By Source

Number of Sources

Number of Days to keep Kafka

Total Storage Needed in Kafka

10-30x

Page 26: Ingesting Healthcare Data, Micah Whitacre

Kafka Storage Woes

● Monitor ALL THE THINGS○ Broker free space○ Disk usage per topic○ Consumer lag in message count and max latency○ Rate of data per source to detect anomalies vs steady

state● Re-evaluate default retention with more evidence

Page 27: Ingesting Healthcare Data, Micah Whitacre

Kafka Storage Woes Solution

● When storage gets tight know your options○ Automate building new servers○ Adjust retention policy for a topic(s)

● Balancing partitions is hard to do by hand○ Balance in small batches○ Automate, Automate, Automate

Page 28: Ingesting Healthcare Data, Micah Whitacre

NoSQL

NoSQL

NoSQL

Collector

Crawler

Crawler

Crawler

Page 29: Ingesting Healthcare Data, Micah Whitacre

NoSQL

Kafka

NoSQL

Collector

Crawler

Crawler

Crawler

Page 30: Ingesting Healthcare Data, Micah Whitacre

NoSQL

NoSQL

NoSQL

Collector

Crawler

Crawler

Crawler

NoSQL

NoSQL

NoSQLDataCenter A

DataCenter B

Collector

Crawler

Crawler

Crawler

Page 31: Ingesting Healthcare Data, Micah Whitacre

Current Stats● Deployed in 3 (soon to be 4) data centers● 440 sources currently (⅓ of all clients)● Ingesting 2 billion messages per day

○ Spiked as high as 6 billion

● Ingest 1.2 TB/day of raw data● Archive job runs hourly and takes ~10 mins to pull ~50 GB

data● Latency

○ NoSQL: 2-3 seconds (subset of data)○ Replication (Kafka to Kafka): 700 milliseconds (all the data)

Page 32: Ingesting Healthcare Data, Micah Whitacre

http://engineering.cerner.com/