Upload
aadi
View
48
Download
1
Embed Size (px)
DESCRIPTION
Sailesh Krishnamurthy, Michael Franklin, Jeff Davis, Daniel Farina, Pasha Golovko , Alan Li, Neil Thombre June 10, 2010 SIGMOD, Indianapolis. Continuous Analytics Over Discontinuous Streams. Founded in 2005 Roots in TelegraphCQ project from UC Berkeley HQ in Foster CIty , CA - PowerPoint PPT Presentation
Citation preview
Continuous Analytics Over Discontinuous Streams
Sailesh Krishnamurthy, Michael Franklin,
Jeff Davis, Daniel Farina, Pasha Golovko, Alan Li, Neil Thombre
June 10, 2010SIGMOD, Indianapolis
• Founded in 2005• Roots in TelegraphCQ project from UC
Berkeley• HQ in Foster CIty, CA• Focus on “Continuous Analytics”• Fortune 100 and web-based Big Data
Customers
3
Data Records / “Events”
Update Display
Real-TimeAnalysis
CQ ProcessorSource Data
Stream Query Processing (Traditional View)
4
SQL Execution On Streaming Data
• A stream is an unbounded sequence of records• A table is a set of records• Window operators convert streams to tables• SQL queries apply to tables
Window Operator
• Each window produces a set of records (a table)• Semantics:
• Repeatedly apply generic SQL to the results of window operators
• Results are continuously appended to the output stream
5
Example: SQL Queries over Streams
SELECT I.Advertiser, SUM(I.price*I.volume)FROM Impressions I <VISIBLE ‘5 sec’ ADVANCE ‘3 sec’>, Campaigns CWHERE I.campaign_id = C.campaign_id and C.type = ‘CPM’GROUP BY I.Advertiser
“I want to look at 5 seconds worth of impressions”
“I want results every 3 seconds”
Every 3 seconds, compute the revenue by advertiser based on impression data, over a 5 second “sliding window”
Result(s)
Impression Data Stream
Result(s)…
Window
Window Operator Clause
Assumptions About Streams
6
Continuous sequencesArriving mostly in order
467 5 38 1, 2
The Reality
7
6
9
10 5
3
3
5
4 2
94 3
2
4
Minutes, Hours, Days, late arriving DataMultiple streams out of sync, with gaps, …
1, 5, ?
Traditional (in Order) Solution #1: “Slack”
8
1 1 1 2 2 1,2 3 3 1,2,3 4 2 1,2,2,3 5 6 6 1,2,2,3 6 5 5,6 7 1 5,6 8 9 9 5,6 9 8 8,9
Time Stamp
3-Second Slack Buffer OUTPUTTuple #
Slack
9
• Pros• Simple• Handles “jitter” (slightly out of
order arrival)
• Cons• Introduces delay• Permanently drops arrivals later than buffer• Unbounded buffer size• Permanently drops arrivals if lulls in multiple
input streams
Traditional (in Order) Solution #2: “Drift”
10
(A,1) (a,2) (A,1)(B,2) (b,3) (a,2), (B,2)(C,3) (c,4) (b,3), (C,3)(G,4) (d,5) (c,4), (G,4)(D,6) (d,5)(E,7) (D,6),(E,7)(R,8) (E,7),(R,8) (D,6)(F,9) (x,5) (R,8),(F,9) (E,7) (z,10) (z,10) (R,8), (F,9)
Source2
2-Second Drift Buffer
OUTPUTSource 1
Drift
11
• Pros• Simple• Handles multiple streams with
short “lulls” in arrival
• Cons• Doesn’t handle streams with dramatically
different arrival rates• Permanently drops data that arrives after drift
window has expired
Traditional Solution #3: Order-agnostic Operators
12
• Slack and Drift aim to order streams before presenting them to order-sensitive operators
• Many operators don’t care about order
SELECT count(*), cq_close(*) tsFROM S <slices ‘5 seconds’>
Out of Order Processing: Count Example
13
1 1 1 2 3 2 3 2 3 4 4 4 5 5 (4,t=5) 6 6 1 7 2 1 8 9 2 9 7 3 10 3 3 11 10 (3,t=10)
Time Stamp
CountState OUTPUT
Tuple #
Heart-Beat
Order-agnostic Operators
14
• Pros• No buffering• No extra delays• Handles out-of-order tuples that
make it before heart-beat
• Cons• Some operators do care about order• Permanently drops data that arrives after
heartbeat• Note: Lost data also impacts bigger “roll up
queries” e.g. <slices 15 seconds> with sharing
So, how to handle very late data and discontinuous streams?
15
16
Integration Framework
Shared Stream Query Processor
Persistent Data Store
SQL Interface
Raw Data Aggregates
“Stream-Relational” Architecture [CIDR 09]
JDBC / JMS XML Flat files ETL tools SOAP APIs
Data Warehouse
App Logic / UDFs
Other TrucQ’s
17
Order-Independent Processing: Overview
• Answers that have already been delivered can only be compensated
• Need to preserve all arriving data • Queries return answers based on
all relevant data that has arrived:• CQ’s: Continuous Queries• SQ’s: SQL queries on archived streams & answers
• Approach: Leverage benefits of SQL(!):• Data-Parallel processing w/on-demand consolidation• Powerful “View” mechanisms
• Basically, create parallel partitions for late data• Rewrite queries as views over partial results
Out of Order Processing: Count Example
18
1 1 1 2 3 2 3 2 3 4 4 4 5 2 5 6 1 6 7 5 (6,t=5) 8 6 1 9 2 1 1 10 9 2 1 11 7 3 1
DataTS Control
Count State Partitions OUTPUT
Tuple #
Out of Order Processing: Count Example
19
11 7 3 1 12 3 3 2 13 10 2 (3,t=10) 14 12 1 2 15 8 1 1 (2,t=5) 16 4 1 1 17 3 1 2 18 9 2 2 19 15 2 2 (1,t=15) 20 flush-2 2 (2,t=10) 21 flush-3 (2,t=5)
DataTS Control
Count State Partitions
OUTPUTTuple # (6,t=5)
Out of Order Processing: Count Example
20
(6,t=5)(3,t=10) (2,t=5)(1,t=15)(2,t=10)(2,t=5)
OUTPUT• Treat output as “Partial State Records”• Rewrite queries using views over PSRs
• i.e., consolidate On-Demand• Paper goes into substantial detail
on how rewrites work• <Slices 5 second>
• Same answer as Order-Insensitive• <Slices 15 second> as roll-up
• Answer contains all data• Subsequent SQs over archived results
and raw data contain all data too!
Handles Very Late Data, Plus You Get…
21
• Parallel Processing – Multicore and Cluster
U
U
D
D
D
D
D
Client
Client
Client
ClientH
igh-
band
wid
th N
etw
ork
Inte
rcon
nect
D = Distributed Processing NodeU = Unified Processing Node
Other Details in the Paper
22
• Beyond late data and parallelism, approach also is key to supporting:• Fault Tolerance using replication• High-Availability via fast restart• “Nostalgic” continuous queries that start in the
past and catch up to the present• Fast concurrent creation of archives for new CQs
• Algorithmic/Systems details on• Integration with overall system architecture• Interaction with Transaction Mechanism• Need for Background Reducer task• Hybrid Plans for non-parallelizable parts of queries
Conclusions
23
• Early Stream Processing Systems were based on simplistic assumptions about ordering
• Truviso’s 3.2 engine incorporates a new mechanism so no data is permanently dropped
• Approach leverages strengths of SQL• Data-parallel processing models• Sophisticated and efficient view functionality
• Key is On-Demand Consolidation• Of course, you can only do it if you have an
integrated stream-relational systemFor more info: [email protected] or [email protected]