Designing a Real Time Data Ingestion Pipeline

Designing Real-TimeData Ingestion PipelineBadar Ahmed

About Us

DataScience Inc. ▪ Data Science as a service▪ Customers from Sonos to Belkin▪ Ranked #1 among "Best Places to

Work in Los Angeles for 2015"

▪ Visit datascience.com!

2

Badar Ahmed ▪ Software Engineer▪ Background in high performance

computing & cloud computing▪ Work across the stack on Big Data

problems

http://datascience.com

Importance of Data Ingestion

▪ Data ingestion is precursor to any analysis▪ Characteristics:▪ Reliable▪ Correctness▪ Speed▪ Scalable

3

Types of Data Ingestion

▪ Broad topic with many different architectural patterns

▪ Real Time▪ Batch

▪ Structured Data▪ Unstructured Data

4

Ingestion Evolution @ DataScience

5

▪ Legacy API existed▪ But ..

✦ Expensive✦ Ops Heavy✦ Hard to scale✦ No batch interface

6

What was needed

▪ Scaleable ingestion system▪ Batch Ingest▪ Lower Ops and $$$ Cost

7

Idea #1

▪ Asynchronous API▪ Queue requests and process them later

Pros:▪ Fast▪ Scaleable

8

9

Issues with Idea #1

▪ Failure introduces complexity▪ Decoupled systems can be more

difficult to debug▪ User UX poorer if they need to keep

track of async requests▪ Lot of deviation from the simpler API

model of ConnectHQ

10

Idea #2

▪ Synchronous Batch so ..✦ UX remains the same

▪ Use Concurrency to do parallel writes to datastore

✦ Caveat: Concurrent code is difficult to write & debug

11

First Step: Prototype

12

First Step: Prototype

13

14

15

Integration Testing

16

Integration Testing

17

Unit Testing with Mocks

18

More Testing

19

More Testing

20

Test & Refactor Cycle

21

Test & Refactor Cycle

22

23

Questions?

24

Thank you.

Development

26

Operations & Monitoring

27

Operations & Monitoring

28

Batch Data Loading

29

Technology

Designing a Real Time Data Ingestion Pipeline