45
SPARK SUMMIT EUROPE 2016 Distributed Time Series Analysis Framework For Spark Larisa Sawyer Two Sigma

Spark Summit EU talk by Larisa Sawyer

Embed Size (px)

Citation preview

Page 1: Spark Summit EU talk by Larisa Sawyer

SPARK SUMMIT EUROPE 2016

Distributed Time Series Analysis Framework For Spark

Larisa SawyerTwo Sigma

Page 2: Spark Summit EU talk by Larisa Sawyer

Larisa Sawyer

November 1, 2016 2

Page 3: Spark Summit EU talk by Larisa Sawyer

$0.0

$500.0

$1,000.0

$1,500.0

$2,000.0

$2,500.0

1/3/

1950

1/3/

1953

1/3/

1956

1/3/

1959

1/3/

1962

1/3/

1965

1/3/

1968

1/3/

1971

1/3/

1974

1/3/

1977

1/3/

1980

1/3/

1983

1/3/

1986

1/3/

1989

1/3/

1992

1/3/

1995

1/3/

1998

1/3/

2001

1/3/

2004

1/3/

2007

1/3/

2010

1/3/

2013

1/3/

2016

S&P 500

Time series examples

November 1, 2016

w Stock market prices

w Temperatures

w Height

w …

18°C

20°C

22°C

24°C

26°C

28°C

30°C

32°C

34°C

New York

Brussels

100cm

110cm

120cm

130cm

140cm

150cm

160cm

170cm

180cm

5 6 7 8 9 10 11 12 13 14 15

Age (years)

Avg US female

100cm

110cm

120cm

130cm

140cm

150cm

160cm

170cm

180cm

5 6 7 8 9 10 11 12 13 14 15

Age (years)

Avg US female

Larisa

3

Page 4: Spark Summit EU talk by Larisa Sawyer

What do we do with time series data?

November 1, 2016

w Forecast future values given past observations

$8.90 $8.95

$8.90

$9.06 $9.10

10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10

corn price??

?

4

Page 5: Spark Summit EU talk by Larisa Sawyer

November 1, 2016

Univariate time series

5

Page 6: Spark Summit EU talk by Larisa Sawyer

Multivariate time series

November 1, 2016

w We can forecast better by joining multiple time series

w Our framework enables fast distributed temporal join of large scale unaligned time series

w Temporal join is a fundamental operation for time series analysis

$8.90 $8.95

$8.90

$9.06 $9.10

10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10

corn price

75°F72°F 71°F 72°F

68°F 67°F65°F

temperature

6

Page 7: Spark Summit EU talk by Larisa Sawyer

Multivariate time series

November 1, 2016

w We can forecast better by joining multiple time series

w Our framework enables fast distributed temporal join of large scale unaligned time series

w Temporal join is a fundamental operation for time series analysis

€7.94€7.98

€7.94

€8.08 €8.12

10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10

corn price

23°C22°C 21°C 22°C

20°C 19°C18°C

temperature

7

Page 8: Spark Summit EU talk by Larisa Sawyer

What is a left join?

November 1, 2016

time series 1 time series 2

8

Page 9: Spark Summit EU talk by Larisa Sawyer

What is temporal join?

November 1, 2016

w A particular join function defined by a matching criteria over time

w Examples of criteria

w look-backward

w look-forward

time series 1 time series 2

look-forward

time series 1 time series 2

look-backwardobservation

9

Page 10: Spark Summit EU talk by Larisa Sawyer

Temporal join with look-backward criteria

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

10

Page 11: Spark Summit EU talk by Larisa Sawyer

Important Legal Information

November 1, 2016 11

The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the solicitation of any offer to buy any security or other interest. We consider this information to be confidential and not for redistribution or dissemination.

Page 12: Spark Summit EU talk by Larisa Sawyer

Temporal join with look-backward criteria

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

time tweets BRK.A

08:00 AM10:00 AM12:00 PM

12

Page 13: Spark Summit EU talk by Larisa Sawyer

Temporal join with look-backward criteria

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

time tweets BRK.A

08:00 AM10:00 AM12:00 PM

13

Page 14: Spark Summit EU talk by Larisa Sawyer

Temporal join with look-backward criteria

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

time tweets BRK.A

08:00 AM10:00 AM12:00 PM

14

Page 15: Spark Summit EU talk by Larisa Sawyer

Temporal join with look-backward criteria

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

time tweets BRK.A

08:00 AM10:00 AM12:00 PM

15

Page 16: Spark Summit EU talk by Larisa Sawyer

Temporal joins in practice

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AM

16

Page 17: Spark Summit EU talk by Larisa Sawyer

Time Series Scale

November 1, 2016

time tweets

08:00 AM10:00 AM12:00 PM

time BRK.A

08:00 AM

11:00 AMWe need fast and scalabledistributed temporal join

17

Page 18: Spark Summit EU talk by Larisa Sawyer

Existing solutions

November 1, 2016

w Existing packages don’t support temporal join or can’t handle large time series

w Pandas / R / Matlab

w Limited to single machine

w Spark

w Does scale, but all data is unordered

w spark-ts

w Expects univariate time series to fit on single machine

w Splits by col

w Supports only snapshot data

18

Page 19: Spark Summit EU talk by Larisa Sawyer

Flint: A new time series library for Spark

November 1, 2016

w Goal

w Provide a collection of functions to manipulate and analyze time series at scale

w Group, temporal join, summarize, aggregate …

w How

w Build a time series aware data structure

w TimeSeriesRDD extends RDD

w Optimize using temporal locality

w Reduce shuffling

w Reduce memory pressure by streaming

19

Page 20: Spark Summit EU talk by Larisa Sawyer

What is a TimeSeriesRDD?

November 1, 2016

w TimeSeriesRDD vs RDD

w Associate time range on each partition

w Track partition time-ranges

w Preserve temporal order

20

Page 21: Spark Summit EU talk by Larisa Sawyer

RDD

November 1, 2016

time temperature6:00 AM 60°F

6:01 AM 61°F

… …

7:00 AM 70°F

7:01 AM 71°F

… …

8:00 AM 80°F

8:01 AM 81°F

… …

RDD(6:00 AM, 60°F)(6:01 AM, 61°F)

…(7:00 AM, 70°F)(7:01 AM, 71°F)

(8:00 AM, 80°F)(8:01 AM, 81°F)

(6:58 AM, 64°F)(6:59 AM, 65°F)

…(7:34 AM, 74°F)(7:35 AM, 74°F)

…(7:58 AM, 76°F)(7:59 AM, 77°F)

Raw Data

21

Page 22: Spark Summit EU talk by Larisa Sawyer

TimeSeriesRDD

November 1, 2016

time temperature6:00 AM 60°F

6:01 AM 61°F

… …

7:00 AM 70°F

7:01 AM 71°F

… …

8:00 AM 80°F

8:01 AM 81°F

… …

RDD(6:00 AM, 60°F)(6:01 AM, 61°F)

…(7:00 AM, 70°F)(7:01 AM, 71°F)

(8:00 AM, 80°F)(8:01 AM, 81°F)

(6:58 AM, 64°F)(6:59 AM, 65°F)

…(7:34 AM, 74°F)(7:35 AM, 74°F)

…(7:58 AM, 76°F)(7:59 AM, 77°F)

(6:00 AM, 60°F)(6:01 AM, 61°F)

(8:00 AM, 80°F)(8:01 AM, 81°F)

(7:00 AM, 70°F)(7:01 AM, 71°F)

TSRDD[06:00 AM, 07:00 AM)

[07:00 AM, 8:00 AM)

[8:00 AM, ∞)

Raw Data

22

Page 23: Spark Summit EU talk by Larisa Sawyer

Group function

November 1, 2016

w A group function groups rows with exactly the same timestamps

time city temperature

1:00 PM New York 70°F

1:00 PM Brussels 60°F

2:00 PM New York 71°F

2:00 PM Brussels 61°F

3:00 PM New York 72°F

3:00 PM Brussels 62°F

4:00 PM New York 73°F

4:00 PM Brussels 63°F

group 1

group 2

group 3

group 4

23

Page 24: Spark Summit EU talk by Larisa Sawyer

Group function

November 1, 2016

w A group function groups rows with nearby timestamps

time city temperature

1:00 PM New York 70°F

1:00 PM Brussels 60°F

2:00 PM New York 71°F

2:00 PM Brussels 61°F

3:00 PM New York 72°F

3:00 PM Brussels 62°F

4:00 PM New York 73°F

4:00 PM Brussels 63°F

group 1

group 2

24

Page 25: Spark Summit EU talk by Larisa Sawyer

Group in Spark

November 1, 2016

w Groups rows with exactly the same timestamps

RDD

1:00PM

2:00PM

2:00PM

1:00PM

3:00PM

3:00PM

4:00PM

4:00PM

25

Page 26: Spark Summit EU talk by Larisa Sawyer

w Data is shuffled and materialized on the workers

Group in Spark

November 1, 2016

RDD

groupBy

RDD

1:00PM

2:00PM

2:00PM

1:00PM

3:00PM

3:00PM

4:00PM

4:00PM

sortBy

RDD

w Back to Temporal Orderw Temporal order is not preserved

26

Page 27: Spark Summit EU talk by Larisa Sawyer

Group in TimeSeriesRDD

November 1, 2016

w Data is grouped per partition locally as streams

TimeSeriesRDD

2:00PM

1:00PM

3:00PM

4:00PM

1:00PM

1:00PM

2:00PM

2:00PM

3:00PM

3:00PM

4:00PM

4:00PM

27

Page 28: Spark Summit EU talk by Larisa Sawyer

• Running time of count after group• 16 executors (10G memory and 4 cores per executor)• Data read from HDFS

Benchmark for group + count

November 1, 2016

0s 20s 40s 60s 80s 100s

20M

40M

60M

80M

100M TimeseriesRDD

DataFrame

RDD50 - 100X5 - 10X28

Page 29: Spark Summit EU talk by Larisa Sawyer

Temporal join

November 1, 2016

w A temporal join function is defined by a matching criteria over time

w A typical matching criteria has two parameters

w direction – look-backward or look-forward

w window – how much to look-backward or look-forward

look-backward temporal join

window

29

Page 30: Spark Summit EU talk by Larisa Sawyer

Temporal join

November 1, 2016

w A temporal join function is defined by a matching criteria over time

w A typical matching criteria has two parameters

w direction – look-backward or look-forward

w window – how much to look-backward or look-forward

look-backward temporal join

window

30

Page 31: Spark Summit EU talk by Larisa Sawyer

Temporal join

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

time series 1

31

time series 2

Page 32: Spark Summit EU talk by Larisa Sawyer

Temporal join

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w How do we do temporal join in TimeSeriesRDD?

TimeSeriesRDD TimeSeriesRDD

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

32

Page 33: Spark Summit EU talk by Larisa Sawyer

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w partition time space into disjoint intervals

TimeSeriesRDD TimeSeriesRDDjoined

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

33

Page 34: Spark Summit EU talk by Larisa Sawyer

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w Build dependency graph for the joined TimeSeriesRDD

TimeSeriesRDD TimeSeriesRDDjoined

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

[1:00 AM, 4:00 AM)

[4:00 AM, 6:00 AM)

[1:00 AM, 4:00 AM)

[4:00 AM, 6:00 AM)

34

Page 35: Spark Summit EU talk by Larisa Sawyer

1:00AM1:00AM

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w Join data as streams per partition

1:00AM

TimeSeriesRDD TimeSeriesRDDjoined

1:00AM

2:00AM

4:00AM

5:00AM

3:00AM

5:00AM

35

Page 36: Spark Summit EU talk by Larisa Sawyer

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w Join data as streams

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

TimeSeriesRDD TimeSeriesRDDjoined

1:00AM 1:00AM1:00AM

2:00AM

36

Page 37: Spark Summit EU talk by Larisa Sawyer

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w Join data as streams

2:00AM

1:00AM

5:00AM

1:00AM

5:00AM

TimeSeriesRDD TimeSeriesRDDjoined

1:00AM

1:00AM

1:00AM

2:00AM

4:00AM

3:00AM

4:00AM

3:00AM

37

Page 38: Spark Summit EU talk by Larisa Sawyer

Temporal join in TimeSeriesRDD

November 1, 2016

w Temporal join with criteria look-back and window of 1 hour

w Join data as streams

2:00AM

1:00AM

4:00AM

5:00AM

1:00AM

3:00AM

5:00AM

TimeSeriesRDD TimeSeriesRDDjoined

1:00AM

1:00AM

1:00AM

2:00AM

4:00AM 3:00AM

5:00AM 5:00AM

38

Page 39: Spark Summit EU talk by Larisa Sawyer

Benchmark for temporal join + count

November 1, 2016

0s 10s 20s 30s 40s 50s 60s 70s 80s 90s 100s

20M

40M

60M

80M

100M TimeseriesRDD

DataFrame

RDD20 - 50X5 - 10X39

• Running time of count after temporal join• 16 executors (10G memory and 4 cores per executor)• Data read from HDFS

Page 40: Spark Summit EU talk by Larisa Sawyer

Functions over TimeSeriesRDD

November 1, 2016

w Grouping functions

w Temporal joins such as look-forward, look-backward etc.

w Summarizers such as average, variance, z-score etc. over grouping functions

40

Page 41: Spark Summit EU talk by Larisa Sawyer

Open Source

November 1, 2016

w True!

w https://github.com/twosigma/flint

41

Page 42: Spark Summit EU talk by Larisa Sawyer

What’s next?

November 1, 2016

w TimeSeriesDataframe / TimeSeriesDataset

w Speed upw Richer APIs

w Python bindings

w Additional summarizers

42

Page 43: Spark Summit EU talk by Larisa Sawyer

Key contributors

November 1, 2016

w Christopher Aycock

w Yuri Bogomolov

w Jonathan Coveney

w Li Jin

w David Medina

w Julia Meinwald

w David Palaitis

w Larisa Sawyer

w Leif Walsh

w Wenbo Zhao

43

Page 44: Spark Summit EU talk by Larisa Sawyer

Flint: Time Series For Spark

November 1, 2016

A library to solve for general time series analysis operations at massive scale

Anne Hathaway has nothing to do with Berkshire Hathaway

Check it out in open source, and contribute

https://github.com/twosigma/flint

44

Page 45: Spark Summit EU talk by Larisa Sawyer

SPARK SUMMIT EUROPE 2016

THANK [email protected]