16
Temporal Operators for Spark Streaming Zhong Chen Microsoft

Temporal operators for spark streaming

Embed Size (px)

Citation preview

Page 1: Temporal operators for spark streaming

Temporal Operators for Spark Streaming

Zhong ChenMicrosoft

Page 2: Temporal operators for spark streaming

Spark Streamingaggregate monitor

Reco

very

Ac

tion

s

Temporal operators

Temporal operators

normalize

Raw

Signals

Topology Data Al

erts

reorderTemporal operators

val reordered = normalized.reorder(Policy.ReorderAdjustAndDrop(Seconds(60), Seconds(120)))

Page 3: Temporal operators for spark streaming

Spark Streamingaggregate monitor

Reco

very

Ac

tion

s

Temporal operators

Temporal operators

normalize

Raw

Signals

Topology Data Al

erts

reorderTemporal operators

val aggregated = reordered.aggregate(TumblingWindow(5 minutes), groupBy(event.rack, event.site, event.region), Sum(event => event.isSuccess), Count(), Avg(event => event.latency))val availabilityStats = aggregated.map{…}

Page 4: Temporal operators for spark streaming

Spark Streamingaggregate monitor

Reco

very

Ac

tion

s

Temporal operators

Temporal operators

normalize

Raw

Signals

Topology Data Al

erts

reorderTemporal operators

val monitoringStats = availabilityStats.join(availabilityStats, left => JoinKey(left.topologyScopeValue), right => JoinKey(right.topologyScopeValue), (left, right) => left.availability < 0.99 && right.availability >= 0.99 TimeDiff(Seconds(-300), Seconds(0))

Page 5: Temporal operators for spark streaming

Operators• Filter, Projection• Windowed Aggregates• Tumbling, Hopping, Session windows

• Temporal joins• Inner, left outer

• Temporal analytic functions• Lag, Last

• Reference data joins

Page 6: Temporal operators for spark streaming

Tumbling Windows

SELECT Topic, Count(*)FROM TwitterStream TIMESTAMP BY CreatedAtGROUP BY Topic, TumblingWindow(Duration(second, 5))

“Every 5 seconds give me the count of tweets by topic”

0 5 2010 15 Time (secs)

A 5-second Tumbling Window

A series of fixed-sized, non-overlapping and contiguous time intervals

Page 7: Temporal operators for spark streaming

Hopping Windows

SELECT Topic, Count(*) AS TotalTweetsFROM TwitterStream TIMESTAMP BY CreatedAtGROUP BY Topic, HoppingWindow(Duration(second, 5), Hop(second, 10))

“Every 5 seconds give me the count of tweets over the last 10 seconds”

1 5 4 26 8 6

0 5 2010 15 Time (secs)

25

A 10-second Hopping Window with a 5-second “Hop”

30

4 26

8 6

5 3 6 1

1 5 4 26

8 6 5 3

6 15 3

Model scheduled overlapping time intervals

Page 8: Temporal operators for spark streaming

Joining multiple streams

{“XO”, 4, “Win10”} {“Jo”, 0, “Surface”} {“Foo”,4, “Bing”}Twitter Stream: {“Dip”, 2, “XBox”}

{“XO”, 0, “Win10”} {“Dip”, 0, “Xbox”}{“Jo”, 4, “Surface”} {“Foo”, 0, “Bing”}Twitter Stream:(same stream,further down the timeline)

SELECT TS1.UserName, TS1.TopicFROM TwitterStream TS1 TIMESTAMP BY CreatedAt JOIN TwitterStream TS2 TIMESTAMP BY CreatedAt

ON TS1.UserName = TS2.UserName AND TS1.Topic = TS2.Topic

AND DateDiff(second, TS1, TS2) BETWEEN 1 AND 60WHERE TS1.SentimentScore != TS2.SentimentScore

time

“List all users and the topics on which they switched their sentiment within a minute“

Page 9: Temporal operators for spark streaming

Detecting absence of events“Show me if a topic is not tweeted for 10 seconds since it was last tweeted”

SELECT TS1.CreatedAt, TS1.Topic, TS1.UserName FROM TwitterStream TS1 TIMESTAMP BY CreatedAtLEFT OUTER JOIN TwitterStream TS2 TIMESTAMP BY CreatedAt

ON TS1.Topic = TS2.TopicAND DateDiff(second, TS1, TS2) BETWEEN 1 AND

10WHERE TS2.Topic IS NULL

{“XO”, 4, “Win10”} {“WAA”, 2, “Microsoft”} {“AB”, 0, “Bing}{“Dip”, 4, “Xbox”}

{“Foo”, 0, “Win10”} {“Tim”, 2, “Microsoft”} {“AB”, 0, “Bing”}

time

Twitter Stream:

Twitter Stream:(same stream,further down the timeline)

Page 10: Temporal operators for spark streaming

Lag“Compute the rate of growth per sensor”

SELECT sensorId, growth = reading – LAG(reading) OVER (PARTITION BY sensorId LIMIT DURATION(hour, 1))FROM input TIMESTAMP BY Time

{“s2”, 70, 50} {“s3”, 71, 52} {“s1”, 72,50}Sensor Reading: {“s1”, 72, 52}

time

Page 11: Temporal operators for spark streaming

Data enrichment“Select the users who are from US”

SELECT stream.userId, refdata.userCountryFROM stream TIMESTAMP BY TimeJOIN refdata ON stream.userId = refdata.userIdWHERE refdata.country = 'US'

Page 12: Temporal operators for spark streaming

Design goals• Use of Event Time• Handling of out of order events• Complete, correct, and repeatable

Page 13: Temporal operators for spark streaming

Implementation techniques• In order processing (reorder then process)• Out of order processing

Page 14: Temporal operators for spark streaming

Reorder then process• Reorder using HWM of time from events• sensitive to partitioning

• Reorder using time from punctuation events• not sensitive to partitioning, but require client to generate the punctuation

• Challenges• Delayed result generation in some cases• Not optimal memory usage• Computation is not well amortized• Merge of logical partitions• Dry shard problem

Page 15: Temporal operators for spark streaming

Out of order processing• Results can be calculated and emitted right away• filter, projection, inner join

• Results are generated when a HWM is reached• Computation is done incrementally using a Reduce function (windowed

aggregates)

• Computation is done when HWM is reached• NULL generation for left outer join• analytic function• session window

Page 16: Temporal operators for spark streaming

Out of order processing• Challenges• Implementation complexity• Need to fall back to in order processing in some cases• Not optimal memory usage in some cases• Still have dry shard problem