65
Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri , John Liagouris , Moritz Hoffmann , Desislava Dimitrova , Matthew Forshaw †† , Timothy Roscoe Support: Systems Group, Department of Computer Science, ETH Zürich, fi[email protected] †† Newcastle University, fi[email protected]

fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

Three steps is all you need fast, accurate, automatic scaling decisions

for distributed streaming dataflows

Vasiliki Kalavri†, John Liagouris†, Moritz Hoffmann†,Desislava Dimitrova†, Matthew Forshaw††, Timothy Roscoe†

Support:

†Systems Group, Department of Computer Science, ETH Zürich, [email protected]

††Newcastle University, [email protected]

Page 2: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

Any streaming job will inevitably become over- or under-provisioned in the future

2

Page 3: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

Any streaming job will inevitably become over- or under-provisioned in the future

even

ts/s

time

load shedding

: input rate : throughput

2

Page 4: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

even

ts/s

time

idle resources

Any streaming job will inevitably become over- or under-provisioned in the future

even

ts/s

time

load shedding

: input rate : throughput

2

Page 5: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

even

ts/s

time

idle resources

even

ts/s

time

SLO violation

Any streaming job will inevitably become over- or under-provisioned in the future

even

ts/s

time

load shedding

: input rate : throughput

2

Page 6: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

3

input streams

parallel dataflow

output stream

CONFIGURING PARALLELISM FOR A STREAMING JOB

Page 7: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

3

input streams

parallel dataflow

output stream

1. monitor event rates

2. configure parallelism

3. deploy and test performance

until the target throughput is met

CONFIGURING PARALLELISM FOR A STREAMING JOB

Page 8: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

THE SCALING PROBLEM

4

Given a logical dataflow with sources S1, S2, … Sn and rates λ1, λ2, … λn identify the minimum parallelism πi per operator i, such that the physical

dataflow can sustain all source rates.

S1

S2

λ1

λ2

S1

S2

π=2

π=3

logical dataflow physical dataflow

Page 9: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

5

scaling controllermetrics

policy

scaling action

AUTOMATIC SCALING OVERVIEW

Page 10: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

5

scaling controllermetrics

policy

scaling action

detect symptoms

AUTOMATIC SCALING OVERVIEW

Page 11: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

5

scaling controllermetrics

policy

scaling action

detect symptoms

decide whether to scale

AUTOMATIC SCALING OVERVIEW

Page 12: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

5

scaling controllermetrics

policy

scaling action

detect symptoms

decide whether to scale

decide how much to scale

AUTOMATIC SCALING OVERVIEW

Page 13: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

6

policy scaling actionBorealis

StreamCloud

Seep

IBM Streams

Spark Streaming

Google Dataflow

Dhalion

Existing approachesmetrics

Page 14: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

6

policy scaling action

CPU utilizationbacklog, tuples/s

backpressure signal

Borealis

StreamCloud

Seep

IBM Streams

Spark Streaming

Google Dataflow

Dhalion

Existing approachesmetrics

Page 15: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

6

policy scaling action

CPU utilizationbacklog, tuples/s

backpressure signal

threshold and rule-basedif CPU > 80% => scale

Borealis

StreamCloud

Seep

IBM Streams

Spark Streaming

Google Dataflow

Dhalion

Existing approachesmetrics

Page 16: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

6

policy scaling action

CPU utilizationbacklog, tuples/s

backpressure signal

threshold and rule-basedif CPU > 80% => scale

small changes,one operator at a time

Borealis

StreamCloud

Seep

IBM Streams

Spark Streaming

Google Dataflow

Dhalion

Existing approachesmetrics

Page 17: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

6

policy scaling action

CPU utilizationbacklog, tuples/s

backpressure signal

threshold and rule-basedif CPU > 80% => scale

small changes,one operator at a time

problematic due to interference, multitenancy

Borealis

StreamCloud

Seep

IBM Streams

Spark Streaming

Google Dataflow

Dhalion

Existing approachesmetrics

Page 18: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

6

policy scaling action

CPU utilizationbacklog, tuples/s

backpressure signal

threshold and rule-basedif CPU > 80% => scale

small changes,one operator at a time

problematic due to interference, multitenancy

sensitive to noise, manual, hard to tune

Borealis

StreamCloud

Seep

IBM Streams

Spark Streaming

Google Dataflow

Dhalion

Existing approachesmetrics

Page 19: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

6

policy scaling action

CPU utilizationbacklog, tuples/s

backpressure signal

threshold and rule-basedif CPU > 80% => scale

small changes,one operator at a time

problematic due to interference, multitenancy

sensitive to noise, manual, hard to tune

non-predictive, speculative steps

Borealis

StreamCloud

Seep

IBM Streams

Spark Streaming

Google Dataflow

Dhalion

Existing approachesmetrics

Page 20: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

6

policy scaling action

CPU utilizationbacklog, tuples/s

backpressure signal

threshold and rule-basedif CPU > 80% => scale

small changes,one operator at a time

problematic due to interference, multitenancy

sensitive to noise, manual, hard to tune

non-predictive, speculative steps

Borealis

StreamCloud

Seep

IBM Streams

Spark Streaming

Google Dataflow

Dhalion

Existing approachesmetrics

oscillations

Page 21: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

6

policy scaling action

CPU utilizationbacklog, tuples/s

backpressure signal

threshold and rule-basedif CPU > 80% => scale

small changes,one operator at a time

problematic due to interference, multitenancy

sensitive to noise, manual, hard to tune

non-predictive, speculative steps

Borealis

StreamCloud

Seep

IBM Streams

Spark Streaming

Google Dataflow

Dhalion

Existing approachesmetrics

oscillationstemporary over- and under-provisioning

Page 22: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

6

policy scaling action

CPU utilizationbacklog, tuples/s

backpressure signal

threshold and rule-basedif CPU > 80% => scale

small changes,one operator at a time

problematic due to interference, multitenancy

sensitive to noise, manual, hard to tune

non-predictive, speculative steps

Borealis

StreamCloud

Seep

IBM Streams

Spark Streaming

Google Dataflow

Dhalion

Existing approachesmetrics

oscillations slow convergence

temporary over- and under-provisioning

Page 23: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

7

effect of Dhalion’s scaling actions in an initially under-provisioned wordcount dataflow

Page 24: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

7

effect of Dhalion’s scaling actions in an initially under-provisioned wordcount dataflow

12

3 654

Page 25: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

8

externally observed

threshold-based

non-predictive, single-operator

policy

scaling action

metrics

Page 26: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

true rates through instrumentation

dataflow dependency model

predictive, dataflow-wide

actions

externally observed

threshold-based

non-predictive, single-operator

policy

scaling action

metrics

8

OUR APPROACH: DS2

Page 27: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

no oscillations

fast convergence

true rates as bounds to avoid

over/under-shoot

true rates through instrumentation

dataflow dependency model

predictive, dataflow-wide

actions

externally observed

threshold-based

non-predictive, single-operator

policy

scaling action

metrics

8

OUR APPROACH: DS2

Page 28: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

o1src o2

10 rec/s 100 rec/s

backpressuretarget: 40 rec/s

9

Page 29: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

o1src o2

10 rec/s 100 rec/s

backpressuretarget: 40 rec/s

9

observed view

Which operator is the bottleneck?

What if we scale ο1 x 4?

How much to scale ο2?

Page 30: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

10

o1src o2

10 rec/s 100 rec/s

backpressuretarget: 40 rec/s

observed view

Page 31: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

10

o1src o2

10 rec/s 100 rec/s

backpressuretarget: 40 rec/s

observed view

Page 32: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

src

o1

o2

Time (s)

10 10 10

1 2 3 4

100 100: busy: waiting

0.5s

instrumentation

DS2 view

10

o1src o2

10 rec/s 100 rec/s

backpressuretarget: 40 rec/s

observed view

Page 33: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

src

o1

o2

Time (s)

10 10 10

1 2 3 4

100 100: busy: waiting

0.5s

instrumentation

DS2 view

10

o1src o2

10 rec/s 100 rec/s

backpressuretarget: 40 rec/s

observed view

ο1 is the bottleneck

Page 34: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

src

o1

o2

Time (s)

10 10 10

1 2 3 4

100 100: busy: waiting

0.5s

instrumentation

DS2 view

10

o1src o2

10 rec/s 100 rec/s

backpressuretarget: 40 rec/s

observed view

ο1 is the bottleneck

true rate = 200 rec/s

2 ο2 instances can keep up with the rate

of 4 ο1 instances

Page 35: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

THE DS2 MODEL

11

Page 36: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

THE DS2 MODELUseful time: The time spent by an operator instance in deserialization, processing, and serialization activities.

11

Page 37: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

THE DS2 MODELUseful time: The time spent by an operator instance in deserialization, processing, and serialization activities.

11

True processing (resp. output) rate: The number of records an operator instance can process (resp. output) per unit of useful time.

Page 38: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

THE DS2 MODELUseful time: The time spent by an operator instance in deserialization, processing, and serialization activities.

11

True processing (resp. output) rate: The number of records an operator instance can process (resp. output) per unit of useful time.

Optimal parallelism for oi

aggregated true output rate of upstream opsaverage true processing rate of oi

:

Page 39: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

CONVERGENCE STEPS

12

parallelism

initial rate

target• no overshoot when scaling up: ideal rate is an upper bound

• no undershoot when scaling down: ideal rate is a lower bound

if the actual scaling is linear, convergence takes one step

p0

x

Page 40: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

CONVERGENCE STEPS

12

parallelism

initial rate

targetpre

dictio

n

• no overshoot when scaling up: ideal rate is an upper bound

• no undershoot when scaling down: ideal rate is a lower bound

if the actual scaling is linear, convergence takes one step

p0

x

Page 41: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

CONVERGENCE STEPS

12

parallelism

initial rate

targetpre

dictio

n

• no overshoot when scaling up: ideal rate is an upper bound

• no undershoot when scaling down: ideal rate is a lower bound

if the actual scaling is linear, convergence takes one step

p0 p1

x

x

Page 42: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

CONVERGENCE STEPS

12

parallelism

initial rate

targetpre

dictio

n

• no overshoot when scaling up: ideal rate is an upper bound

• no undershoot when scaling down: ideal rate is a lower bound

actu

alif the actual scaling is linear,

convergence takes one step

p0 p1

x

x

Page 43: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

13

In practice, rates are commonly sub-linear due to other overheads (e.g. worker coordination).

parallelism

initial rate

target

x

predic

tion

x

p0 p1

CONVERGENCE STEPS

Page 44: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

13

In practice, rates are commonly sub-linear due to other overheads (e.g. worker coordination).

parallelism

initial rate

target

x

predic

tion

xactual

p0 p1

CONVERGENCE STEPS

Page 45: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

13

In practice, rates are commonly sub-linear due to other overheads (e.g. worker coordination).

parallelism

initial rate

target

x

predic

tion

xactual

x error

when the actual scaling is sub-linear, convergence takes more than one steps p0 p1

CONVERGENCE STEPS

Page 46: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

14

In practice, rates are commonly sub-linear due to other overheads (e.g. worker coordination).

parallelism

p0

initial rate

target

p1

predic

tion

actual

xx

p2

In our experiments, DS2 took up to three steps to converge

for complex queries.

x

p3

CONVERGENCE STEPS

Page 47: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

15

DS2 operates online in a reactive setting

Instrumented stream processor

Scaling Manager Scaling Policy

Metrics Repository

invoke

re-scale job

report metrics

monitor

pull metrics

decision

Timely dataflow

Apache Flink

Page 48: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

EVALUATION

Page 49: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

DS2 VS. DHALION ON HERON

17

wordcount

Target rate: 16.700 rec/s

Page 50: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

DS2 VS. DHALION ON HERON

17

wordcount

DS2 converges in a single step for both operators

Target rate: 16.700 rec/s

Page 51: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

DS2 VS. DHALION ON HERON

17

wordcount

DS2 converges in a single step for both operators

DS2 converges in 60s, i.e. as soon as it receives the Heron

metrics

Target rate: 16.700 rec/s

Page 52: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

DS2 VS. DHALION ON HERON

17

wordcount

DS2 converges in a single step for both operators

Dhalion scales one operator at a time, resulting to a total

of six steps

DS2 converges in 60s, i.e. as soon as it receives the Heron

metrics

Target rate: 16.700 rec/s

Page 53: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

DS2 VS. DHALION ON HERON

17

wordcount

DS2 converges in a single step for both operators

Dhalion scales one operator at a time, resulting to a total

of six steps

DS2 converges in 60s, i.e. as soon as it receives the Heron

metrics

Dhalion converges in 2000s

Target rate: 16.700 rec/s

Page 54: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

DS2 VS. DHALION ON HERON

17

wordcount

DS2 converges in a single step for both operators

Dhalion scales one operator at a time, resulting to a total

of six steps

DS2 converges in 60s, i.e. as soon as it receives the Heron

metrics

Dhalion converges in 2000s

+10 counts

+12 mappers

Target rate: 16.700 rec/s

Page 55: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

DS2 ON APACHE FLINK

18

wordcount

Target rate: 2.000.000 rec/s

Page 56: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

DS2 ON APACHE FLINK

18

wordcount

Apache Flink savepoint and

reconfiguration takes ~30s

Target rate: 2.000.000 rec/s

Page 57: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

DS2 ON APACHE FLINK

18

wordcount

Apache Flink savepoint and

reconfiguration takes ~30s

DS2 converges in two steps for both operators Target rate: 2.000.000 rec/s

Page 58: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

DS2 ON APACHE FLINK

18

wordcount

Apache Flink savepoint and

reconfiguration takes ~30s

DS2 converges in two steps for both operators

DS2 reacts 3s after the target

rate has changed

Target rate: 2.000.000 rec/s

Page 59: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

CONVERGENCE - NEXMARK

initial parallelism

Q1: flatmap

Q2: filter

Q3: incremental

join

Q5: tumbling window

join

Q8: sliding

window

Q11: session window

8 => 12 => 16 11 => 13 => 14 16 => 20 14 => 15 => 16 10 12 => 22 => 28

12 => 16 14 18 => 20 16 10 22 => 28

16 => 16 12 => 14 20 16 8 => 10 26 => 28

20 => 16 13 => 14 20 14 => 16 8 => 10 28

24 => 16 14 20 14 => 16 8 => 10 28

28 => 16 14 20 13 => 16 8 => 10 28

=> : scaling action

19

Page 60: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

CONVERGENCE - NEXMARK

initial parallelism

Q1: flatmap

Q2: filter

Q3: incremental

join

Q5: tumbling window

join

Q8: sliding

window

Q11: session window

8 => 12 => 16 11 => 13 => 14 16 => 20 14 => 15 => 16 10 12 => 22 => 28

12 => 16 14 18 => 20 16 10 22 => 28

16 => 16 12 => 14 20 16 8 => 10 26 => 28

20 => 16 13 => 14 20 14 => 16 8 => 10 28

24 => 16 14 20 14 => 16 8 => 10 28

28 => 16 14 20 13 => 16 8 => 10 28

=> : scaling action

scale-up

19

Page 61: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

CONVERGENCE - NEXMARK

initial parallelism

Q1: flatmap

Q2: filter

Q3: incremental

join

Q5: tumbling window

join

Q8: sliding

window

Q11: session window

8 => 12 => 16 11 => 13 => 14 16 => 20 14 => 15 => 16 10 12 => 22 => 28

12 => 16 14 18 => 20 16 10 22 => 28

16 => 16 12 => 14 20 16 8 => 10 26 => 28

20 => 16 13 => 14 20 14 => 16 8 => 10 28

24 => 16 14 20 14 => 16 8 => 10 28

28 => 16 14 20 13 => 16 8 => 10 28

=> : scaling action

scale-up

scale-down

19

Page 62: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

CONVERGENCE - NEXMARK

initial parallelism

Q1: flatmap

Q2: filter

Q3: incremental

join

Q5: tumbling window

join

Q8: sliding

window

Q11: session window

8 => 12 => 16 11 => 13 => 14 16 => 20 14 => 15 => 16 10 12 => 22 => 28

12 => 16 14 18 => 20 16 10 22 => 28

16 => 16 12 => 14 20 16 8 => 10 26 => 28

20 => 16 13 => 14 20 14 => 16 8 => 10 28

24 => 16 14 20 14 => 16 8 => 10 28

28 => 16 14 20 13 => 16 8 => 10 28

a single step for many queries and initial configurations

=> : scaling action

scale-up

scale-down

19

Page 63: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

CONVERGENCE - NEXMARK

initial parallelism

Q1: flatmap

Q2: filter

Q3: incremental

join

Q5: tumbling window

join

Q8: sliding

window

Q11: session window

8 => 12 => 16 11 => 13 => 14 16 => 20 14 => 15 => 16 10 12 => 22 => 28

12 => 16 14 18 => 20 16 10 22 => 28

16 => 16 12 => 14 20 16 8 => 10 26 => 28

20 => 16 13 => 14 20 14 => 16 8 => 10 28

24 => 16 14 20 14 => 16 8 => 10 28

28 => 16 14 20 13 => 16 8 => 10 28

at most 3 steps

a single step for many queries and initial configurations

=> : scaling action

scale-up

scale-down

19

Page 64: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

RECAP

20

Observed metrics threshold-based policies can lead to

oscillations, misconfiguration and slow convergence.

DS2 uses instrumentation to measure true processing and

output rates and estimate parallelism for all operators at once.

Stream processor

Scaling Manager Scaling Policy

Metrics Repositor

invoke

re-scale job

report metrics

monitor

pull metrics

decision

Timely dataflow

DS2 makes fast and accurate scaling decisions and converges in up to three steps even for non-

linear, complex dataflows.

https://github.com/strymon-system/ds2

Page 65: fast, accurate, automatic scaling decisions for ... · Three steps is all you need fast, accurate, automatic scaling decisions for distributed streaming dataflows Vasiliki Kalavri†,

Three steps is all you need fast, accurate, automatic scaling decisions

for distributed streaming dataflows

Vasiliki Kalavri†, John Liagouris†, Moritz Hoffmann†,Desislava Dimitrova†, Matthew Forshaw††, Timothy Roscoe†

Support:

†Systems Group, Department of Computer Science, ETH Zürich, [email protected]

††Newcastle University, [email protected]