Upload
flink-forward
View
360
Download
0
Embed Size (px)
Citation preview
Changing Workloads And SLAs
2
Resource Adaption
3
+
time
WorkloadResources
time
WorkloadResources
What Is This Talk About?§ Flink’s approach to dynamic scaling
§ Current state with demo
§ Outlook on next development steps
4
Dynamic Scaling
5
Basic Idea
6
• Spread work across more workers to decrease workload
Scaling Stateless Jobs
7
Scale Up Scale DownSource
Mapper
Sink
• Scale up: Deploy new tasks• Scale down: Cancel running tasks
Scaling Stateful Jobs
8
?
• Problem: Which state to assign to new task?
State in Apache Flink
9
Keyed vs. Non-keyed State
10
• State bound to a key• E.g. Keyed UDF and window state
• State bound to a subtask• E.g. Source state
Keyed Non-keyed
Repartitioning Keyed State§ Similar to consistent
hashing§ Split key space into
key groups§ Assign key groups to
tasks
11
Key space
Key group #1 Key group #2
Key group #3Key group #4
Repartitioning Keyed State contd.
§ Rescaling changes key group assignment
§ Maximum parallelism defined by #key groups
12
Repartitioning Non-keyed state§ User defined merge and
split functions• Most general approach
§ Breaking non-keyed state up into finer granularity• State has to contain
multiple entries• Automatic repartitioning
wrt granularity
13
#1 #2
#3
Repartitioning Non-keyed State contd.
§ Non-keyed state entries gathered at the job manager
§ Repartitioning schemes• Repartition & send• Union & broadcast
14
Example: Kafka Source
15
partitionId: 1, offset: 42partitionId: 3, offset: 10partitionId: 6, offset: 27
• Store offset for each partition• Individual entries are repartitionable
partitionId: 1, offset: 42partitionId: 3, offset: 10partitionId: 6, offset: 27
Rescaling: Why is That so Hard?§ Handling of state§ Repartitioning of keyed & non-keyed
state§ Unique among open source stream
processors, afaik
16
Demo Time
17
Demo Topology
18
Kafka Source Counter
KeyBy
Current State and next Steps
19
Current State
§ Manual rescaling1. Take savepoint2. Stop the job3. Restart job with adjusted parallelism and
savepoint
20
Next Steps§ Integrate savepoint with stop signal§ Rescaling individual operators w/o restart§ Dynamic container de-/allocation• “Running Flink Everywhere” by Stephan
Ewen, 16:45 at Kesselhaus
21
Auto Scaling Policies
22
• Latency• Throughput• Resource utilization
• Kubernetes on GCE, EC2 and Mesos (marathon-autoscale) already support auto-scaling
Conclusion§ Scaling of keyed and non-keyed state§ Flink supports manual rescaling with
restart (WIP branch: https://github.com/tillrohrmann/flink/tree/partitionable-op-state)
§ Future versions might support scaling on the fly and automatic rescaling policies
23