Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Shadi A. Noghabi
Prelim Committee: Prof. Roy Campbell (advisor), Prof. Indy Gupta (advisor), Prof. Klara Nahrstedt, Dr. Victor Bahl
1
Building Interactive Distributed Processing Applications at a Global Scale
Many Interactive Applications
2
< few seconds100s of PBs data
< few minutesTrillions of events
< few millisecondsMillions - billions of devices
Low Latency at Large Scale
• Latency-sensitive, reaching low latency is an important for interactivity
100ms latency (Amazon) → 1% loss in revenue $4M per millisec! *
• Large scale processing, at large and global scales
My focus: Building Interactive systems at a production scale
3* https://blog.gigaspaces.com/amazon-found-every-100ms-of-latency-cost-them-1-in-sales/
4
At Scale, Low Latency is Challenging
Heterogeneity• Data• Machines• Operations
Dynamism• Workload • Environment
Distribution
State & Replication
Imbalance
5
Thesis Statement
Latency-driven designs can be used to build productioninfrastructures for interactive applications at a global
scale
while addressing myriad challenges on heterogeneity,dynamism, state, and imbalance.
Latency-driven Design
Achieving latency has the highest priority
6
Consistency
Availability
Latency
PACELCLatency
Throughput
Latency
Isolation
1. Abadi, “Consistency Tradeoffs in Modern Distributed Database System Design”, Computer 20122. Brewer, Gilbert, and Lynch “Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services”, Sigact News 2002
1,2
Related WorkLatency-Driven in other frameworks
• Trend of NoSQL systems: Cassandra, Amazon’s DynamoDB, Pileus.
Latency-Driven does not fit all• Consistency driven:
– Consistent Storage: BigTable, Spanner, Yahoo’s PNUTS, Espresso– Exactly-once Stream processing: Flink, Millwheel, Spark Streaming, Trident
• Throughput driven:– Batch processing: Hadoop, Spark, Tez, Pig and Hive– Micro-batch streaming: Spark Streaming, media processing– High-throughput Storage: HDFS and GFS
7
Latency-driven Techniques
8
Background processing
Tiered data access Latency
DrivenLoad
balancing
Adaptive Scaling
Partitioning & parallelism
Opportunistic processing
Locality
Latency-driven at All Layers
9
Processing
Networking
DC 1 DC 2
DC n
Storage
Ambry
[SIGMOD’16]
Samza
[VLDB’17]
Steel
[HotCloud’18]
FreeFlow
[HotNets’17]
Edge
Edge
Edge
…Edge
…
Main production system >4 years & 500M users at LinkedInUsed at ~15 Companies, with 100s of applications at LinkedInIntegrated with production Azure and Windows IoTGeneral container networking applicable to Docker, Kubernetes, etc.
Latency-driven Techniques
10
Background processing
Tiered data access Latency
DrivenLoad
balancing
Adaptive Scaling
Partitioning & parallelism
Opportunistic processing
Locality Ambry
[SIGMOD’16]
Samza
[VLDB’17]
Steel
[HotCloud’18]
FreeFlow
[HotNets’17]
Sto
rage
Pro
cess
ing
Net
wo
rkin
g
Latency-driven Techniques
11
Background processing
Tiered data access Latency
DrivenLoad
balancing
Adaptive Scaling
Partitioning & parallelism
Opportunistic processing
Locality Ambry
[SIGMOD’16]
Samza
[VLDB’17]
Steel
[HotCloud’18]
FreeFlow
[HotNets’17]
Sto
rage
Pro
cess
ing
Net
wo
rkin
g
Samza: Stateful Scalable Stream Processing at LinkedIn
VLDB’17
Shadi A. Noghabi*, Kartik Paramasivam^, Yi Pan^, NavinaRamesh^, Jon Bringhurst^, Indranil Gupta*, Roy Campbell*
* University of Illinois at Urbana-Champaign^ LinkedIn Corp. +
Stream Processing*
13
Interactive feeds or ads
Security & Monitoring
Internet of Things
*Stream processing is the processing of data in motion, i.e., computing on data immediately as it is produced
Stream Applications need State
14
State: durable data read/written along with the processing
Many cases need large state• Aggregation: aggregating counts over a window• Join: two (or more) streams over a window• Other: machine learning model
Access Stream(IP, access time)
DoS Stream(Attacker IP)
DoSIdentifier
IP Count
1.1.1.1 2
2.3.2.1 30
Large scale of input, in Kafka alone:• 2.1 Trillion msg/Day, 16 Million msg/sec peaks• 0.5 PB in, 2 PB out per day
Many applications:• 100s of production applications• Over 10,000 containers
Large scale of state• Several 100s of TBs for a single application
15
Scale of Processing
Stateful Stream Processing
16
• Opensource Apache project
• Powers hundreds of apps in LinkedIn’s production
• In use at LinkedIn, Uber, Metamarkets, Netflix, Intuit, TripAdvisor, VMware, Optimizely, Redfin, etc.
Need to handle state with low latency, at large-scale, and with correct (consistent) results
http://samza.apache.org/
Stateful Stream Processing
Processing Model• Input partitioning
• Parallel and independent tasks
• Locality aware data passing
17
Stateful• Local state
• Incremental checkpointing
• 3-Tier caching
• Parallel recovery
• Host Stickiness
• Compaction
Stateful Stream Processing
Processing Model• Input partitioning
• Parallel and independent tasks
• Locality aware data passing
18
Stateful• Local state
• Incremental checkpointing
• 3-Tier caching
• Parallel recovery
• Host Stickiness
• Compaction
Processing from a Partitioned Source
19
Stream of events
Ex: page access, ad views, etc.
Processing from a Partitioned Source
20
Stream of events
Ex: page access, ad views, etc.
Partition 1
Partition 2
Partition k
…
Processing from a Partitioned Source
21
Stream of events
Ex: page access, ad views, etc.
Partition 1
Partition 2
Partition k
…
Processing from a Partitioned Source
22
Stream of events
Ex: page access, ad views, etc.
Partition 1
Partition 2
Partition k
…
Task 1
Task 2
Task k
…
Partitioning & parallelism
Multi-Stage Dataflow
Application logic: Count number of ‘Page accesses’ for each ip domain in a 5 minute window
23
Application DAG in tasksPage Accessin stream
DoS Streamout stream
Windowcount
Filter SendTo
Data Parallelism: each task performs the entire pipeline on its own chunk of input data
Locality
Stateful Stream Processing
Processing Model• Input partitioning
• Parallel and independent tasks
• Locality aware data passing
24
Stateful• Local state
• Incremental checkpointing
• 3-Tier caching
• Parallel recovery
• Host Stickiness
• Compaction
Stateful Stream Processing
Processing Model• Input partitioning
• Parallel and independent tasks
• Locality aware data passing
25
Stateful• Local state
• Incremental checkpointing
• 3-Tier caching
• Parallel recovery
• Host Stickiness
• Compaction
Stateful Processing
Store count of accessper ip domain
But, using a remote store is slow
State:Ip access count
Samza Application
Task 1
Task 2
Task k
…
26
Stateful Processing
State:Ip access count
Samza Application
Task 1
Task 2
Task k
…
27
Stateful Processing
State:Ip access count
Samza Application
Task 1
Task 2
Task k
…
Local State: use the local storage of tasks• Each task, state corresponding to their partition.
• Ex: task 1, processing input [1-100], state [1-100]
[1-100]
[900-1000]
[100-200]
Locality
28
Stateful Processing
Samza Application
Task 1
Task 2
Task k
…
Local Storage:• in-memory: fast, but low capacity (GBs)• on-disk: high capacity (TBs) and persistent, but slow• Disk-mem: Disk (persistent store) + memory (cache)
• Fast, high capacity, and persistent
[1-100]
[900-1000]
[100-200]
Tiered data access
29
Stateful Processing
Samza Application
Task 1
Task 2
Task k
…
[1-100]
[900-1000]
[100-200]
30
• What about failures?
• How to not lose state?
Fault-Tolerance
...
partition k partition 2 partition 1
Changeloge.g., Kafka log
compacted topic
Task 1
Task 2
Task k
…
Changes saved to a durable change log
→ Recovery by replay change-log
Periodically batch & flush changes
Fault-Tolerance
...
partition k partition 2 partition 1
Changeloge.g., Kafka log
compacted topic
Task 1
Task 2
Task k
…
X
Changes saved to a durable change log
→ Recovery by replay change-log
Periodically batch & flush changes
...
partition k partition 2 partition 1
Changeloge.g., Kafka log
compacted topic
Fault-Tolerance
Task 1
Task 2
Task k
…
X = 10
Changes saved to a durable change log
→ Recovery by replay change-log
Periodically batch & flush changes
...
partition k partition 2 partition 1
Changeloge.g., Kafka log
compacted topic
Fault-Tolerance
Task 1
Task 2
Task k
…
Changes saved to a durable change log
→ Recovery by replay change-log
Periodically batch & flush changes
X = 10
Background processing
But, what is the cost?
35
Latency Consistency
Availability
But, what is the cost?
36
Latency Consistency
Availability
Replayable input +Consistent snapshot
Longer failure recovery
Consistency Guarantees
...
partition k partition 2 partition 1
Changeloge.g., Kafka topic
Task 1
Task 2
Task k
…
X = 10
Offset 1005
Offsetse.g., Kafka topic
Offset1005
Availability Optimizations
• Availability is compromised. To improve we use:
– Parallel recovery
– Background compaction
– Host stickiness
38
Background processing
Locality
Partitioning & parallelism
Task-1 Task-4 Task-2 Task-3
Durable : Task-Container-Host Mapping
Task 1, Task 4 -> Host-ATask 2 -> Host-BTask 3 -> Host-C
Job
Input Stream
Change-log
Fast Restarts with Host Stickiness
Host Stickiness in YARN : - Try to place task on same host
after restart- Minimize state rebuilding
Overhead
Host-B Host-CHost-A
Related Work• Stateless: do not support state
– Apache Storm [SIGMOD’14], Heron [SIGMOD’15], S4 [ICDM’10]
• Remote Store: relying on external storage– Trident, MillWheel [VLDB’13], Dataflow [VLDB’15]
– High latency overhead per input message
– but, fast recovery (better availability)
• Local state, but, full state checkpointing: – Flink [TCDE’15], Spark Streaming [HotCloud’12], IBM Streams
[VLDB’16], Streamscope [NDSI’16], System S [DEBS’11]
– Does not scale for large application state
– but, easier “repeatable results” at failure
40
Evaluation
41
Evaluation Setup
• Production Cluster– 500 node YARN cluster– real world applications
• Small Cluster– 8 node cluster: 64GB RAM, 24 core CPUs, a 1.6 TB SSD
– micro-benchmarks• Read-only workload ~ join with table• Read-write workload ~ aggregation over time
42
Local State -- Latency
43
Stores:• in-mem• on-disk• disk-mem• remote
Fault-tolerance:• None • changelog (Clog)• remote
changelog adds minimal overhead
on disk w/ caching comparable with in memory
> 2 orders of magnitude slower compared to local
state
Failure Recovery
44
Growing linearly with size of state
Almost constant overhead with host stickiness
Overhead independent of % of failures (parallel
recovery)
StickySticky
Summary of State in Samza
• 100x better latency
• Better resource utilization
• no need for remote DB resources
• Parallel and independent
processing
Pros Cons
• Dependent to input partitioning• Does NOT work when state is
large and not co-partitionable
in input stream
• Auto-scaling becomes harder
• Recovery is slower (vs remote)
Conclusion
46
Processing Model• Input partitioning
• Parallel and independent tasks
• Locality aware data passing
Stateful• Local state• Incremental checkpointing• 3-Tier caching• Parallel recovery• Host Stickiness• Compaction
Background processing
Locality
Partitioning & parallelism
Need to handle state with low latency, at large-scale, and with correct (consistent) results
Latency-driven Techniques
47
Background processing
Tiered data access Latency
DrivenLoad
balancing
Adaptive Scaling
Partitioning & parallelism
Opportunistic processing
Locality Ambry
[SIGMOD’16]
Samza
[VLDB’17]
Steel
[HotCloud’18]
FreeFlow
[HotNets’17]
Sto
rage
Pro
cess
ing
Net
wo
rkin
g
Latency-driven Techniques
48
Background processing
Tiered data access Latency
DrivenLoad
balancing
Adaptive Scaling
Partitioning & parallelism
Opportunistic processing
Locality Ambry
[SIGMOD’16]
Samza
[VLDB’17]
Steel
[HotCloud’18]
FreeFlow
[HotNets’17]
Sto
rage
Pro
cess
ing
Net
wo
rkin
g
Steel: Unified and Optimized Edge-Cloud Environment
HotCloud’18
Shadi A. Noghabi*, John Kolb^, Peter Bodik**, Eduardo Cuervo**, Indranil Gupta*, Roy Campbell*
* University of Illinois at Urbana-Champaign
^ University of California, Berkeley
** Microsoft Research49
Cloud is Not an One-Size-Fits-All
50
Latency
< 80ms< 20ms
< 10 ms
…The Cloud is simply too far!
> 70 ms round trip time
Cloud is Not an One-Size-Fits-All
51
Latency
< 80ms< 20ms
< 10 ms
…Y
Resources
… 10s – 100s GB/s
Bandwidth
BatteryEnergyHeating
Not enough resources to use the
Cloud
Cloud is Not an One-Size-Fits-All
52
Latency
< 80ms< 20ms
< 10 ms
…Y
Resources
… 10s – 100s GB/s
Bandwidth
BatteryEnergyHeating
Privacy & Security
…
Edge Computing
53
Cloud
Edge 1
Edge 2
Edge 3
Satyanarayanan, M., Bahl, P., et. al. “The Case for VM-based Cloudlets in Mobile Computing”, Pervasive Computing 2009
However…Industry is at its infancy of building Edge-Cloud
applications
• No proper unification of Edge & Cloud
– Hard to configure, deploy, and monitor
– Manual and redundant optimizations
54
Steel: A unified edge cloud framework with modular and automated optimizations
Integrated with productionAzure services
55
Steel
Abstraction (Logical Spec)
Compile DeployMonitor &
AnalyzeFabric
Placement Communication …Optimization Modules
Load Balancing
Applications
Edge-Cloud Ecosystem
AB
CD
Optimizer Modules
56
PlacementWhere (edge/cloud) to place?
Adapt to long-term changes
Communicationconfigure edge-cloud links
Adapt to short-term spikes
Background processing
Locality
Opportunistic processing
• Location and vicinity aware• Background profiling & what-if analysis • Background batching & compression• Partitioned and parallel optimizations• Change strategy opportunistically
(available resources)
Partitioning & parallelism
Latency-driven Techniques
57
Background processing
Tiered data access Latency
DrivenLoad
balancing
Adaptive Scaling
Partitioning & parallelism
Opportunistic processing
Locality Ambry
[SIGMOD’16]
Samza
[VLDB’17]
Steel
[HotCloud’18]
FreeFlow
[HotNets’17]
Sto
rage
Pro
cess
ing
Net
wo
rkin
g
Latency-driven Techniques
58
Background processing
Tiered data access Latency
DrivenLoad
balancing
Adaptive Scaling
Partitioning & parallelism
Opportunistic processing
Locality Ambry
[SIGMOD’16]
Samza
[VLDB’17]
Steel
[HotCloud’18]
FreeFlow
[HotNets’17]
Sto
rage
Pro
cess
ing
Net
wo
rkin
g
Ambry: LinkedIn’s Scalable Geo-Distributed Object Store
SIGMOD ‘16
Shadi A. Noghabi*, Sriram Subramanian+, Priyesh Narayanan+, Sivabalan Narayanan+, Gopalakrishna Holla+, Mammad Zadeh+, Tianwei Li+, Indranil Gupta*, Roy H. Campbell*
* University of Illinois at Urbana-ChampaignSRG: http://srg.cs.illinois.edu and DPRG: http://dprg.cs.uiuc.edu
+ LinkedIn Corporation Data Infrastructure team 59
Massive Media Objects are Everywhere
60
100s of Millions of users
100s of TBs to PBs
From all around the world
61
How to handle all these objects?
We need a geographically distributed system that stores and retrieves objects in an low-latency and
scalable manner
AmbryIn production for >4 years and >500 M users!
Open source: https://github.com/linkedin/ambry
Datacenter 1
Geo-Distribution
62
Count
success
Datacenter 2 Datacenter 3
item_iitem_i
Alice
item_i
Locality
Asynchronous writes• Write synchronously only to local
datacenter
• Asynchronously replicate others
• Reduce user perceived latency
• Minimize cross-DC traffic
Datacenter 2 Datacenter 3
Geo-Distribution
63
Asynchronous writes• Write synchronously only to local
datacenter
• Asynchronously replicate others
• Reduce user perceived latency
• Minimize cross-DC traffic
item_i
Count
success
Datacenter 1
item_i
Background
Replication
Alice
item_i
LocalityBackground
processing
Other Techniques
Ambry is a large production system with many other techniques:
• Logical data partitioning
• Caching, indexing, bloom filters
• Background load balancing
•….
64
Tiered data
access
Partitioning &
parallelism
Load balancing
Latency-driven Techniques
65
Background processing
Tiered data access Latency
DrivenLoad
balancing
Adaptive Scaling
Partitioning & parallelism
Opportunistic processing
Locality Ambry
[SIGMOD’16]
Samza
[VLDB’17]
Steel
[HotCloud’18]
FreeFlow
[HotNets’17]
Sto
rage
Pro
cess
ing
Net
wo
rkin
g
Latency-driven Techniques
66
Background processing
Tiered data access Latency
DrivenLoad
balancing
Adaptive Scaling
Partitioning & parallelism
Opportunistic processing
Locality Ambry
[SIGMOD’16]
Samza
[VLDB’17]
Steel
[HotCloud’18]
FreeFlow
[HotNets’17]
Sto
rage
Pro
cess
ing
Net
wo
rkin
g
FreeFlow: High performance container networking
HotNets ‘16
Tianlong Yu ^, Shadi A. Noghabi *, Shachar Raindel†, Hongqiang Liu†, Jitu Padhye†, and Vyas Sekar ^
* University of Illinois at Urbana-Champaign† Microsoft Research^ Carnegie Mellon University
67
Containers are popular
https://clusterhq.com/assets/pdfs/state-of-container-usage-june-2016.pdf
Containers provide good portability and isolation
68
Network
FreeFlow
Node 2Node 1
FreeFlow
69
Container 3
Shared memory
Container 1
RDMA
Container 2
Locality
Opportunistic processing
Latency-driven Techniques
70
Background processing
Tiered data access Latency
DrivenLoad
balancing
Adaptive Scaling
Partitioning & parallelism
Opportunistic processing
Locality Ambry
[SIGMOD’16]
Samza
[VLDB’17]
Steel
[HotCloud’18]
FreeFlow
[HotNets’17]
Sto
rage
Pro
cess
ing
Net
wo
rkin
g
Future Directions• One global system seamlessly across the globe
– Holistic and cross-layer approach
• Latency sensitivity-aware approaches (not treating all objects/apps equally)– E.g., dynamic use of compression, erasure coding,
compaction in storage for old objects– E.g., multi-tenant stream processing environments
• Auto-pilot systems coping with changes automatically and dynamically
71
Thanks to Many Collaborators
• Advisors: Indranil Gupta and Roy Campbell
• LinkedIn: Sriram Subramanian, Priyesh Narayanan, Kartik Paramasivam, Yi Pan, Navina Ramesh, et al.
• Microsoft Research: Victor Bahl, Peter Bodik, Eduardo Cuervo, Hongqiang Liu, Jitu Padhye, et. al.
72
Publications1. “Steel: Unified Management and Optimization of Edge-Cloud Applications”, SA Noghabi, J Kolb, P Bodik, E Cuervo,
I Gupta, RH Campbell, (in progress)
2. “Steel: Simplified Development and Deployment of Edge-Cloud Applications”, SA Noghabi, J Kolb, P Bodik, E Cuervo, HotCloud 2018
3. “Samza: Stateful Scalable Stream Processing at LinkedIn”, SA Noghabi, K Paramasivam, Y Pan, N Ramesh, J Bringhurst, I Gupta, RH Campbell, VLDB 2017
4. “Ambry: LinkedIn’s Scalable Geo-Distributed Object Store”, SA Noghabi, S Subramanian, P Narayanan, S Narayanan, G Holla, M Zadeh, T Li, I Gupta, RH Campbell, SIGMOD 2016
5. “Freeflow: High performance container networking”, T Yu, SA Noghabi, S Raindel, H Liu, J Padhye, V Sekar, HotNets2016
6. “Steel: Unified Management and Optimization of Edge-Cloud IoT Applications”, NSDI poster 2018
7. “To edge or not to edge?”, F Kalim, SA Noghabi, S Verma, SoCC poster 2017
8. “Building a Scalable Distributed Online Media Processing Environment”, SA Noghabi, PhD workshop at VLDB 2016
9. “Toward Fabric: A Middleware Implementing High-level Description Languages on a Fabric-like Network”, SH Hashemi, SA Noghabi, J Bellessa, RH Campbell, ANCS 2016.
10. “Towards enabling cooperation between scheduler and storage layer to improve job performance”, M Pundir, J Bellessa, SA Noghabi, CL Abad, RH Campbell, PDSW 2015
73
Summary of Thesis
74
Latency-driven designs can be used to build production infrastructures for interactive applications at a global scale
Background processing
Tiered data access Latency
DrivenLoad
balancing
Adaptive Scaling
Partitioning & parallelism
Opportunistic processing
Locality ProcessingSamza [VLDB’17]Steel [HotCloud’18]
StorageAmbry [SIGMOD’16]
NetworkingFreeFlow [HotNets’17]
Back up Slides
75
Theoretical Background
Latency-driven design can be modeled theoretically with Queueing Theory*
System: A hierarchical queuing system
Latency: Response time (response_processed - req_enter_time)
Techniques: queueing theory models/techniques • Parallelism & Partitioning →multiple queues, parallel servers
• Adaptive scaling → adding more servers
• …
76* Gross, Donald. Fundamentals of queueing theory. John Wiley & Sons, 2008.
Other Techniques
Batch + Stream in one System
• Reprocess parts/all stream or Database
– bugs, upgrades, logic changes
→Batch as a finite stream
+ conflict resolution, scaling & throttling
77
Adaptive Scaling
Local State -- Throughput
78
remote state 30-150x worse than local state
on disk w/ caching comparable with in memory
changelog adds minimal overhead
Snapshot vs Changelog
79