Building Interactive Distributed Processing Applications ...dprg.cs.uiuc.edu/docs/shadi_phd_thesis/defense.pdf · PACELC Latency Throughput Latency Isolation 1. Abadi, ^ onsistency

Shadi A. Noghabi

Prelim Committee: Prof. Roy Campbell (advisor), Prof. Indy Gupta (advisor), Prof. Klara Nahrstedt, Dr. Victor Bahl

1

Building Interactive Distributed Processing Applications at a Global Scale

Many Interactive Applications

2

< few seconds100s of PBs data

< few minutesTrillions of events

< few millisecondsMillions - billions of devices

Low Latency at Large Scale

• Latency-sensitive, reaching low latency is an important for interactivity

100ms latency (Amazon) → 1% loss in revenue $4M per millisec! *

• Large scale processing, at large and global scales

My focus: Building Interactive systems at a production scale

3* https://blog.gigaspaces.com/amazon-found-every-100ms-of-latency-cost-them-1-in-sales/

https://blog.gigaspaces.com/amazon-found-every-100ms-of-latency-cost-them-1-in-sales/

4

At Scale, Low Latency is Challenging

Heterogeneity• Data• Machines• Operations

Dynamism• Workload • Environment

Distribution

State & Replication

Imbalance

5

Thesis Statement

Latency-driven designs can be used to build productioninfrastructures for interactive applications at a global

scale

while addressing myriad challenges on heterogeneity,dynamism, state, and imbalance.

Latency-driven Design

Achieving latency has the highest priority

6

Consistency

Availability

Latency

PACELCLatency

Throughput

Latency

Isolation

1. Abadi, “Consistency Tradeoffs in Modern Distributed Database System Design”, Computer 20122. Brewer, Gilbert, and Lynch “Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services”, Sigact News 2002

1,2

https://blog.gigaspaces.com/amazon-found-every-100ms-of-latency-cost-them-1-in-sales/

Related WorkLatency-Driven in other frameworks

• Trend of NoSQL systems: Cassandra, Amazon’s DynamoDB, Pileus.

Latency-Driven does not fit all• Consistency driven:

– Consistent Storage: BigTable, Spanner, Yahoo’s PNUTS, Espresso– Exactly-once Stream processing: Flink, Millwheel, Spark Streaming, Trident

• Throughput driven:– Batch processing: Hadoop, Spark, Tez, Pig and Hive– Micro-batch streaming: Spark Streaming, media processing– High-throughput Storage: HDFS and GFS

7

Latency-driven Techniques

8

Background processing

Tiered data access Latency

DrivenLoad

balancing

Adaptive Scaling

Partitioning & parallelism

Opportunistic processing

Locality

Latency-driven at All Layers

9

Processing

Networking

DC 1 DC 2

DC n

Storage

Ambry

[SIGMOD’16]

Samza

[VLDB’17]

Steel

[HotCloud’18]

FreeFlow

[HotNets’17]

Edge

Edge

Edge

…Edge

…

Main production system >4 years & 500M users at LinkedInUsed at ~15 Companies, with 100s of applications at LinkedInIntegrated with production Azure and Windows IoTGeneral container networking applicable to Docker, Kubernetes, etc.


10



DrivenLoad

balancing

Adaptive Scaling



Locality Ambry

[SIGMOD’16]

Samza

[VLDB’17]

Steel

[HotCloud’18]

FreeFlow

[HotNets’17]

Sto

rage

Pro

cess

ing

Net

wo

rkin

g


11



DrivenLoad

balancing

Adaptive Scaling



Locality Ambry

[SIGMOD’16]

Samza

[VLDB’17]

Steel

[HotCloud’18]

FreeFlow

[HotNets’17]

Sto

rage

Pro

cess

ing

Net

wo

rkin

g

Samza: Stateful Scalable Stream Processing at LinkedIn

VLDB’17

Shadi A. Noghabi*, Kartik Paramasivam^, Yi Pan^, NavinaRamesh^, Jon Bringhurst^, Indranil Gupta*, Roy Campbell*

* University of Illinois at Urbana-Champaign^ LinkedIn Corp. +

Stream Processing*

13

Interactive feeds or ads

Security & Monitoring

Internet of Things

*Stream processing is the processing of data in motion, i.e., computing on data immediately as it is produced

Stream Applications need State

14

State: durable data read/written along with the processing

Many cases need large state• Aggregation: aggregating counts over a window• Join: two (or more) streams over a window• Other: machine learning model

Access Stream(IP, access time)

DoS Stream(Attacker IP)

DoSIdentifier

IP Count

1.1.1.1 2

2.3.2.1 30

Large scale of input, in Kafka alone:• 2.1 Trillion msg/Day, 16 Million msg/sec peaks• 0.5 PB in, 2 PB out per day

Many applications:• 100s of production applications• Over 10,000 containers

Large scale of state• Several 100s of TBs for a single application

15

Scale of Processing

Stateful Stream Processing

16

• Opensource Apache project

• Powers hundreds of apps in LinkedIn’s production

• In use at LinkedIn, Uber, Metamarkets, Netflix, Intuit, TripAdvisor, VMware, Optimizely, Redfin, etc.

Need to handle state with low latency, at large-scale, and with correct (consistent) results

http://samza.apache.org/

http://samza.apache.org/


Processing Model• Input partitioning

• Parallel and independent tasks

• Locality aware data passing

17

Stateful• Local state

• Incremental checkpointing

• 3-Tier caching

• Parallel recovery

• Host Stickiness

• Compaction





18



• 3-Tier caching


• Host Stickiness

• Compaction

Processing from a Partitioned Source

19

Stream of events

Ex: page access, ad views, etc.


20

Stream of events


Partition 1

Partition 2

Partition k

…


21

Stream of events


Partition 1

Partition 2

Partition k

…


22

Stream of events


Partition 1

Partition 2

Partition k

…

Task 1

Task 2

Task k

…


Multi-Stage Dataflow

Application logic: Count number of ‘Page accesses’ for each ip domain in a 5 minute window

23

Application DAG in tasksPage Accessin stream

DoS Streamout stream

Windowcount

Filter SendTo

Data Parallelism: each task performs the entire pipeline on its own chunk of input data

Locality





24



• 3-Tier caching


• Host Stickiness

• Compaction





25



• 3-Tier caching


• Host Stickiness

• Compaction

Stateful Processing

Store count of accessper ip domain

But, using a remote store is slow

State:Ip access count

Samza Application

Task 1

Task 2

Task k

…

26

Stateful Processing


Samza Application

Task 1

Task 2

Task k

…

27

Stateful Processing


Samza Application

Task 1

Task 2

Task k

…

Local State: use the local storage of tasks• Each task, state corresponding to their partition.

• Ex: task 1, processing input [1-100], state [1-100]

[1-100]

[900-1000]

[100-200]

Locality

28

Stateful Processing

Samza Application

Task 1

Task 2

Task k

…

Local Storage:• in-memory: fast, but low capacity (GBs)• on-disk: high capacity (TBs) and persistent, but slow• Disk-mem: Disk (persistent store) + memory (cache)

• Fast, high capacity, and persistent

[1-100]

[900-1000]

[100-200]

Tiered data access

29

Stateful Processing

Samza Application

Task 1

Task 2

Task k

…

[1-100]

[900-1000]

[100-200]

30

• What about failures?

• How to not lose state?

Fault-Tolerance

...

partition k partition 2 partition 1

Changeloge.g., Kafka log

compacted topic

Task 1

Task 2

Task k

…

Changes saved to a durable change log

→ Recovery by replay change-log

Periodically batch & flush changes

Fault-Tolerance

...



compacted topic

Task 1

Task 2

Task k

…

X




...



compacted topic

Fault-Tolerance

Task 1

Task 2

Task k

…

X = 10




...



compacted topic

Fault-Tolerance

Task 1

Task 2

Task k

…




X = 10


But, what is the cost?

35

Latency Consistency

Availability

But, what is the cost?

36

Latency Consistency

Availability

Replayable input +Consistent snapshot

Longer failure recovery

Consistency Guarantees

...


Changeloge.g., Kafka topic

Task 1

Task 2

Task k

…

X = 10

Offset 1005

Offsetse.g., Kafka topic

Offset1005

Availability Optimizations

• Availability is compromised. To improve we use:

– Parallel recovery

– Background compaction

– Host stickiness

38


Locality


Task-1 Task-4 Task-2 Task-3

Durable : Task-Container-Host Mapping

Task 1, Task 4 -> Host-ATask 2 -> Host-BTask 3 -> Host-C

Job

Input Stream

Change-log

Fast Restarts with Host Stickiness

Host Stickiness in YARN : - Try to place task on same host

after restart- Minimize state rebuilding

Overhead

Host-B Host-CHost-A

Related Work• Stateless: do not support state

– Apache Storm [SIGMOD’14], Heron [SIGMOD’15], S4 [ICDM’10]

• Remote Store: relying on external storage– Trident, MillWheel [VLDB’13], Dataflow [VLDB’15]

– High latency overhead per input message

– but, fast recovery (better availability)

• Local state, but, full state checkpointing: – Flink [TCDE’15], Spark Streaming [HotCloud’12], IBM Streams

[VLDB’16], Streamscope [NDSI’16], System S [DEBS’11]

– Does not scale for large application state

– but, easier “repeatable results” at failure

40

Evaluation

41

Evaluation Setup

• Production Cluster– 500 node YARN cluster– real world applications

• Small Cluster– 8 node cluster: 64GB RAM, 24 core CPUs, a 1.6 TB SSD

– micro-benchmarks• Read-only workload ~ join with table• Read-write workload ~ aggregation over time

42

Local State -- Latency

43

Stores:• in-mem• on-disk• disk-mem• remote

Fault-tolerance:• None • changelog (Clog)• remote

changelog adds minimal overhead

on disk w/ caching comparable with in memory

> 2 orders of magnitude slower compared to local

state

Failure Recovery

44

Growing linearly with size of state

Almost constant overhead with host stickiness

Overhead independent of % of failures (parallel

recovery)

StickySticky

Summary of State in Samza

• 100x better latency

• Better resource utilization

• no need for remote DB resources

• Parallel and independent

processing

Pros Cons

• Dependent to input partitioning• Does NOT work when state is

large and not co-partitionable

in input stream

• Auto-scaling becomes harder

• Recovery is slower (vs remote)

Conclusion

46




Stateful• Local state• Incremental checkpointing• 3-Tier caching• Parallel recovery• Host Stickiness• Compaction


Locality


Need to handle state with low latency, at large-scale, and with correct (consistent) results


47



DrivenLoad

balancing

Adaptive Scaling



Locality Ambry

[SIGMOD’16]

Samza

[VLDB’17]

Steel

[HotCloud’18]

FreeFlow

[HotNets’17]

Sto

rage

Pro

cess

ing

Net

wo

rkin

g


48



DrivenLoad

balancing

Adaptive Scaling



Locality Ambry

[SIGMOD’16]

Samza

[VLDB’17]

Steel

[HotCloud’18]

FreeFlow

[HotNets’17]

Sto

rage

Pro

cess

ing

Net

wo

rkin

g

Steel: Unified and Optimized Edge-Cloud Environment

HotCloud’18

Shadi A. Noghabi*, John Kolb^, Peter Bodik**, Eduardo Cuervo**, Indranil Gupta*, Roy Campbell*

* University of Illinois at Urbana-Champaign

^ University of California, Berkeley

** Microsoft Research49

Cloud is Not an One-Size-Fits-All

50

Latency

< 80ms< 20ms

< 10 ms

…The Cloud is simply too far!

> 70 ms round trip time


51

Latency

< 80ms< 20ms

< 10 ms

…Y

Resources

… 10s – 100s GB/s

Bandwidth

BatteryEnergyHeating

Not enough resources to use the

Cloud


52

Latency

< 80ms< 20ms

< 10 ms

…Y

Resources

… 10s – 100s GB/s

Bandwidth

BatteryEnergyHeating

Privacy & Security

…

Edge Computing

53

Cloud

Edge 1

Edge 2

Edge 3

Satyanarayanan, M., Bahl, P., et. al. “The Case for VM-based Cloudlets in Mobile Computing”, Pervasive Computing 2009

However…Industry is at its infancy of building Edge-Cloud

applications

• No proper unification of Edge & Cloud

– Hard to configure, deploy, and monitor

– Manual and redundant optimizations

54

Steel: A unified edge cloud framework with modular and automated optimizations

Integrated with productionAzure services

55

Steel

Abstraction (Logical Spec)

Compile DeployMonitor &

AnalyzeFabric

Placement Communication …Optimization Modules

Load Balancing

Applications

Edge-Cloud Ecosystem

AB

CD

Optimizer Modules

56

PlacementWhere (edge/cloud) to place?

Adapt to long-term changes

Communicationconfigure edge-cloud links

Adapt to short-term spikes


Locality


• Location and vicinity aware• Background profiling & what-if analysis • Background batching & compression• Partitioned and parallel optimizations• Change strategy opportunistically

(available resources)



57



DrivenLoad

balancing

Adaptive Scaling



Locality Ambry

[SIGMOD’16]

Samza

[VLDB’17]

Steel

[HotCloud’18]

FreeFlow

[HotNets’17]

Sto

rage

Pro

cess

ing

Net

wo

rkin

g


58



DrivenLoad

balancing

Adaptive Scaling



Locality Ambry

[SIGMOD’16]

Samza

[VLDB’17]

Steel

[HotCloud’18]

FreeFlow

[HotNets’17]

Sto

rage

Pro

cess

ing

Net

wo

rkin

g

Ambry: LinkedIn’s Scalable Geo-Distributed Object Store

SIGMOD ‘16

Shadi A. Noghabi*, Sriram Subramanian+, Priyesh Narayanan+, Sivabalan Narayanan+, Gopalakrishna Holla+, Mammad Zadeh+, Tianwei Li+, Indranil Gupta*, Roy H. Campbell*

* University of Illinois at Urbana-ChampaignSRG: http://srg.cs.illinois.edu and DPRG: http://dprg.cs.uiuc.edu

+ LinkedIn Corporation Data Infrastructure team 59

http://srg.cs.illinois.edu

http://dprg.cs.uiuc.edu/

Massive Media Objects are Everywhere

60

100s of Millions of users

100s of TBs to PBs

From all around the world

61

How to handle all these objects?

We need a geographically distributed system that stores and retrieves objects in an low-latency and

scalable manner

AmbryIn production for >4 years and >500 M users!

Open source: https://github.com/linkedin/ambry

https://github.com/linkedin/ambry

Datacenter 1

Geo-Distribution

62

Count

success

Datacenter 2 Datacenter 3

item_iitem_i

Alice

item_i

Locality

Asynchronous writes• Write synchronously only to local

datacenter

• Asynchronously replicate others

• Reduce user perceived latency

• Minimize cross-DC traffic

Datacenter 2 Datacenter 3

Geo-Distribution

63

Asynchronous writes• Write synchronously only to local

datacenter

• Asynchronously replicate others

• Reduce user perceived latency

• Minimize cross-DC traffic

item_i

Count

success

Datacenter 1

item_i

Background

Replication

Alice

item_i

LocalityBackground

processing

Other Techniques

Ambry is a large production system with many other techniques:

• Logical data partitioning

• Caching, indexing, bloom filters

• Background load balancing

•….

64

Tiered data

access

Partitioning &

parallelism

Load balancing


65



DrivenLoad

balancing

Adaptive Scaling



Locality Ambry

[SIGMOD’16]

Samza

[VLDB’17]

Steel

[HotCloud’18]

FreeFlow

[HotNets’17]

Sto

rage

Pro

cess

ing

Net

wo

rkin

g


66



DrivenLoad

balancing

Adaptive Scaling



Locality Ambry

[SIGMOD’16]

Samza

[VLDB’17]

Steel

[HotCloud’18]

FreeFlow

[HotNets’17]

Sto

rage

Pro

cess

ing

Net

wo

rkin

g

FreeFlow: High performance container networking

HotNets ‘16

Tianlong Yu ^, Shadi A. Noghabi *, Shachar Raindel†, Hongqiang Liu†, Jitu Padhye†, and Vyas Sekar ^

* University of Illinois at Urbana-Champaign† Microsoft Research^ Carnegie Mellon University

67

Containers are popular

https://clusterhq.com/assets/pdfs/state-of-container-usage-june-2016.pdf

Containers provide good portability and isolation

68

https://clusterhq.com/assets/pdfs/state-of-container-usage-june-2016.pdf

Network

FreeFlow

Node 2Node 1

FreeFlow

69

Container 3

Shared memory

Container 1

RDMA

Container 2

Locality



70



DrivenLoad

balancing

Adaptive Scaling



Locality Ambry

[SIGMOD’16]

Samza

[VLDB’17]

Steel

[HotCloud’18]

FreeFlow

[HotNets’17]

Sto

rage

Pro

cess

ing

Net

wo

rkin

g

Future Directions• One global system seamlessly across the globe

– Holistic and cross-layer approach

• Latency sensitivity-aware approaches (not treating all objects/apps equally)– E.g., dynamic use of compression, erasure coding,

compaction in storage for old objects– E.g., multi-tenant stream processing environments

• Auto-pilot systems coping with changes automatically and dynamically

71

Thanks to Many Collaborators

• Advisors: Indranil Gupta and Roy Campbell

• LinkedIn: Sriram Subramanian, Priyesh Narayanan, Kartik Paramasivam, Yi Pan, Navina Ramesh, et al.

• Microsoft Research: Victor Bahl, Peter Bodik, Eduardo Cuervo, Hongqiang Liu, Jitu Padhye, et. al.

72

Publications1. “Steel: Unified Management and Optimization of Edge-Cloud Applications”, SA Noghabi, J Kolb, P Bodik, E Cuervo,

I Gupta, RH Campbell, (in progress)

2. “Steel: Simplified Development and Deployment of Edge-Cloud Applications”, SA Noghabi, J Kolb, P Bodik, E Cuervo, HotCloud 2018

3. “Samza: Stateful Scalable Stream Processing at LinkedIn”, SA Noghabi, K Paramasivam, Y Pan, N Ramesh, J Bringhurst, I Gupta, RH Campbell, VLDB 2017

4. “Ambry: LinkedIn’s Scalable Geo-Distributed Object Store”, SA Noghabi, S Subramanian, P Narayanan, S Narayanan, G Holla, M Zadeh, T Li, I Gupta, RH Campbell, SIGMOD 2016

5. “Freeflow: High performance container networking”, T Yu, SA Noghabi, S Raindel, H Liu, J Padhye, V Sekar, HotNets2016

6. “Steel: Unified Management and Optimization of Edge-Cloud IoT Applications”, NSDI poster 2018

7. “To edge or not to edge?”, F Kalim, SA Noghabi, S Verma, SoCC poster 2017

8. “Building a Scalable Distributed Online Media Processing Environment”, SA Noghabi, PhD workshop at VLDB 2016

9. “Toward Fabric: A Middleware Implementing High-level Description Languages on a Fabric-like Network”, SH Hashemi, SA Noghabi, J Bellessa, RH Campbell, ANCS 2016.

10. “Towards enabling cooperation between scheduler and storage layer to improve job performance”, M Pundir, J Bellessa, SA Noghabi, CL Abad, RH Campbell, PDSW 2015

73

Summary of Thesis

74

Latency-driven designs can be used to build production infrastructures for interactive applications at a global scale



DrivenLoad

balancing

Adaptive Scaling



Locality ProcessingSamza [VLDB’17]Steel [HotCloud’18]

StorageAmbry [SIGMOD’16]

NetworkingFreeFlow [HotNets’17]

Back up Slides

75

Theoretical Background

Latency-driven design can be modeled theoretically with Queueing Theory*

System: A hierarchical queuing system

Latency: Response time (response_processed - req_enter_time)

Techniques: queueing theory models/techniques • Parallelism & Partitioning →multiple queues, parallel servers

• Adaptive scaling → adding more servers

• …

76* Gross, Donald. Fundamentals of queueing theory. John Wiley & Sons, 2008.

Other Techniques

Batch + Stream in one System

• Reprocess parts/all stream or Database

– bugs, upgrades, logic changes

→Batch as a finite stream

+ conflict resolution, scaling & throttling

77

Adaptive Scaling

Local State -- Throughput

78

remote state 30-150x worse than local state

on disk w/ caching comparable with in memory

changelog adds minimal overhead

Snapshot vs Changelog

79

Documents

Building Interactive Distributed Processing Applications ...dprg.cs.uiuc.edu/docs/shadi_phd_thesis/defense.pdf · PACELC Latency Throughput Latency Isolation 1. Abadi, ^ onsistency