Upload
shareddatamsft
View
88
Download
1
Embed Size (px)
Citation preview
Temporal Operators for Spark Streaming
Zhong ChenMicrosoft
Spark Streamingaggregate monitor
Reco
very
Ac
tion
s
Temporal operators
Temporal operators
normalize
Raw
Signals
Topology Data Al
erts
reorderTemporal operators
val reordered = normalized.reorder(Policy.ReorderAdjustAndDrop(Seconds(60), Seconds(120)))
Spark Streamingaggregate monitor
Reco
very
Ac
tion
s
Temporal operators
Temporal operators
normalize
Raw
Signals
Topology Data Al
erts
reorderTemporal operators
val aggregated = reordered.aggregate(TumblingWindow(5 minutes), groupBy(event.rack, event.site, event.region), Sum(event => event.isSuccess), Count(), Avg(event => event.latency))val availabilityStats = aggregated.map{…}
Spark Streamingaggregate monitor
Reco
very
Ac
tion
s
Temporal operators
Temporal operators
normalize
Raw
Signals
Topology Data Al
erts
reorderTemporal operators
val monitoringStats = availabilityStats.join(availabilityStats, left => JoinKey(left.topologyScopeValue), right => JoinKey(right.topologyScopeValue), (left, right) => left.availability < 0.99 && right.availability >= 0.99 TimeDiff(Seconds(-300), Seconds(0))
Operators• Filter, Projection• Windowed Aggregates• Tumbling, Hopping, Session windows
• Temporal joins• Inner, left outer
• Temporal analytic functions• Lag, Last
• Reference data joins
Tumbling Windows
SELECT Topic, Count(*)FROM TwitterStream TIMESTAMP BY CreatedAtGROUP BY Topic, TumblingWindow(Duration(second, 5))
“Every 5 seconds give me the count of tweets by topic”
0 5 2010 15 Time (secs)
A 5-second Tumbling Window
A series of fixed-sized, non-overlapping and contiguous time intervals
Hopping Windows
SELECT Topic, Count(*) AS TotalTweetsFROM TwitterStream TIMESTAMP BY CreatedAtGROUP BY Topic, HoppingWindow(Duration(second, 5), Hop(second, 10))
“Every 5 seconds give me the count of tweets over the last 10 seconds”
1 5 4 26 8 6
0 5 2010 15 Time (secs)
25
A 10-second Hopping Window with a 5-second “Hop”
30
4 26
8 6
5 3 6 1
1 5 4 26
8 6 5 3
6 15 3
Model scheduled overlapping time intervals
Joining multiple streams
{“XO”, 4, “Win10”} {“Jo”, 0, “Surface”} {“Foo”,4, “Bing”}Twitter Stream: {“Dip”, 2, “XBox”}
{“XO”, 0, “Win10”} {“Dip”, 0, “Xbox”}{“Jo”, 4, “Surface”} {“Foo”, 0, “Bing”}Twitter Stream:(same stream,further down the timeline)
SELECT TS1.UserName, TS1.TopicFROM TwitterStream TS1 TIMESTAMP BY CreatedAt JOIN TwitterStream TS2 TIMESTAMP BY CreatedAt
ON TS1.UserName = TS2.UserName AND TS1.Topic = TS2.Topic
AND DateDiff(second, TS1, TS2) BETWEEN 1 AND 60WHERE TS1.SentimentScore != TS2.SentimentScore
time
“List all users and the topics on which they switched their sentiment within a minute“
Detecting absence of events“Show me if a topic is not tweeted for 10 seconds since it was last tweeted”
SELECT TS1.CreatedAt, TS1.Topic, TS1.UserName FROM TwitterStream TS1 TIMESTAMP BY CreatedAtLEFT OUTER JOIN TwitterStream TS2 TIMESTAMP BY CreatedAt
ON TS1.Topic = TS2.TopicAND DateDiff(second, TS1, TS2) BETWEEN 1 AND
10WHERE TS2.Topic IS NULL
{“XO”, 4, “Win10”} {“WAA”, 2, “Microsoft”} {“AB”, 0, “Bing}{“Dip”, 4, “Xbox”}
{“Foo”, 0, “Win10”} {“Tim”, 2, “Microsoft”} {“AB”, 0, “Bing”}
time
Twitter Stream:
Twitter Stream:(same stream,further down the timeline)
Lag“Compute the rate of growth per sensor”
SELECT sensorId, growth = reading – LAG(reading) OVER (PARTITION BY sensorId LIMIT DURATION(hour, 1))FROM input TIMESTAMP BY Time
{“s2”, 70, 50} {“s3”, 71, 52} {“s1”, 72,50}Sensor Reading: {“s1”, 72, 52}
time
Data enrichment“Select the users who are from US”
SELECT stream.userId, refdata.userCountryFROM stream TIMESTAMP BY TimeJOIN refdata ON stream.userId = refdata.userIdWHERE refdata.country = 'US'
Design goals• Use of Event Time• Handling of out of order events• Complete, correct, and repeatable
Implementation techniques• In order processing (reorder then process)• Out of order processing
Reorder then process• Reorder using HWM of time from events• sensitive to partitioning
• Reorder using time from punctuation events• not sensitive to partitioning, but require client to generate the punctuation
• Challenges• Delayed result generation in some cases• Not optimal memory usage• Computation is not well amortized• Merge of logical partitions• Dry shard problem
Out of order processing• Results can be calculated and emitted right away• filter, projection, inner join
• Results are generated when a HWM is reached• Computation is done incrementally using a Reduce function (windowed
aggregates)
• Computation is done when HWM is reached• NULL generation for left outer join• analytic function• session window
Out of order processing• Challenges• Implementation complexity• Need to fall back to in order processing in some cases• Not optimal memory usage in some cases• Still have dry shard problem