53
PySpark for Time Series Analysis David Palaitis Two Sigma Investments

New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Embed Size (px)

Citation preview

Page 1: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

PySpark for Time Series Analysis

David Palaitis Two Sigma Investments

Page 2: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

About Me

Page 3: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Important Legal InformationThe information presented here is offered for recruiting purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the solicitation of any offer to buy any security or other interest. We consider this information to be confidential and not for redistribution or dissemination. Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

Page 4: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 5: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Time Series

IOT feeds

sensor data

economic data

An ordered sequence of values of a variable

Page 6: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Time Series Analysis

Page 7: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Time Series Analysis

Page 8: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Time Series Analysis

Page 9: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Time Series at Two Sigma

Millions of Time Series

Big and Small

(1GB – 1PB)

Narrow (10 columns) and Wide (1MM Columns)

Evenly and Unevenly

Spaced Observations

Page 10: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 11: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Let’s start from the beginning …

Page 12: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 13: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 14: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 15: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 16: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 17: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 18: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 19: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 20: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 21: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Examples!

Page 22: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

What’s Missing?

Page 23: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

You can’t even do “Word Count”

Page 24: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 25: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 26: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 27: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 28: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 29: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 30: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 31: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 32: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

“Word Count” !

Page 33: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

What’s missing? Time.

Page 34: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Windowed Aggregations

Page 35: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Temporal Joins

} window

Page 36: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 37: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 38: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 39: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 40: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

w is a window specification e.g. 500ms, 5s, 3 business days

RDD[(K,V)] -> RDD[(K,Seq[V])]

Page 41: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

reduceByWindow(f: (V, V) => V, w):

RDD[(K, W)] => RDD[(K, V)]

Page 42: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

reduceByWindow(f: (V, V) => V, w):

RDD[(K, V)] => RDD[(K, V)]

Page 43: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

https://github.com/twosigma/flint

Page 44: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Getting Started …

Page 45: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 46: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 47: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 48: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 49: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 50: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 51: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Page 52: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Looking ahead.

Page 53: New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Thank You.Find me after the talk to see Flint in action.