Upload
datascience
View
352
Download
0
Embed Size (px)
Citation preview
Designing Real-TimeData Ingestion PipelineBadar Ahmed
About Us
DataScience Inc. ▪ Data Science as a service▪ Customers from Sonos to Belkin▪ Ranked #1 among "Best Places to
Work in Los Angeles for 2015"
▪ Visit datascience.com!
2
Badar Ahmed ▪ Software Engineer▪ Background in high performance
computing & cloud computing▪ Work across the stack on Big Data
problems
Importance of Data Ingestion
▪ Data ingestion is precursor to any analysis▪ Characteristics:▪ Reliable▪ Correctness▪ Speed▪ Scalable
3
Types of Data Ingestion
▪ Broad topic with many different architectural patterns
▪ Real Time▪ Batch
▪ Structured Data▪ Unstructured Data
4
Ingestion Evolution @ DataScience
5
▪ Legacy API existed▪ But ..
✦ Expensive✦ Ops Heavy✦ Hard to scale✦ No batch interface
6
What was needed
▪ Scaleable ingestion system▪ Batch Ingest▪ Lower Ops and $$$ Cost
7
Idea #1
▪ Asynchronous API▪ Queue requests and process them later
Pros:▪ Fast▪ Scaleable
8
9
Issues with Idea #1
▪ Failure introduces complexity▪ Decoupled systems can be more
difficult to debug▪ User UX poorer if they need to keep
track of async requests▪ Lot of deviation from the simpler API
model of ConnectHQ
10
Idea #2
▪ Synchronous Batch so ..✦ UX remains the same
▪ Use Concurrency to do parallel writes to datastore
✦ Caveat: Concurrent code is difficult to write & debug
11
First Step: Prototype
12
First Step: Prototype
13
14
15
Integration Testing
16
Integration Testing
17
Unit Testing with Mocks
18
More Testing
19
More Testing
20
Test & Refactor Cycle
21
Test & Refactor Cycle
22
23
Questions?
24
Thank you.
Development
26
Operations & Monitoring
27
Operations & Monitoring
28
Batch Data Loading
29