Continuous Analytics Over Discontinuous Streams

Continuous Analytics Over Discontinuous Streams

Sailesh Krishnamurthy, Michael Franklin,

Jeff Davis, Daniel Farina, Pasha Golovko, Alan Li, Neil Thombre

June 10, 2010SIGMOD, Indianapolis

• Founded in 2005• Roots in TelegraphCQ project from UC

Berkeley• HQ in Foster CIty, CA• Focus on “Continuous Analytics”• Fortune 100 and web-based Big Data

Customers

3

Data Records / “Events”

Update Display

Real-TimeAnalysis

CQ ProcessorSource Data

Stream Query Processing (Traditional View)

4

SQL Execution On Streaming Data

• A stream is an unbounded sequence of records• A table is a set of records• Window operators convert streams to tables• SQL queries apply to tables

Window Operator

• Each window produces a set of records (a table)• Semantics:

• Repeatedly apply generic SQL to the results of window operators

• Results are continuously appended to the output stream

5

Example: SQL Queries over Streams

SELECT I.Advertiser, SUM(I.price*I.volume)FROM Impressions I <VISIBLE ‘5 sec’ ADVANCE ‘3 sec’>, Campaigns CWHERE I.campaign_id = C.campaign_id and C.type = ‘CPM’GROUP BY I.Advertiser

“I want to look at 5 seconds worth of impressions”

“I want results every 3 seconds”

Every 3 seconds, compute the revenue by advertiser based on impression data, over a 5 second “sliding window”

Result(s)

Impression Data Stream

Result(s)…

Window

Window Operator Clause

Assumptions About Streams

6

Continuous sequencesArriving mostly in order

467 5 38 1, 2

The Reality

7

6

9

10 5

3

3

5

4 2

94 3

2

4

Minutes, Hours, Days, late arriving DataMultiple streams out of sync, with gaps, …

1, 5, ?

Traditional (in Order) Solution #1: “Slack”

8

1 1 1 2 2 1,2 3 3 1,2,3 4 2 1,2,2,3 5 6 6 1,2,2,3 6 5 5,6 7 1 5,6 8 9 9 5,6 9 8 8,9

Time Stamp

3-Second Slack Buffer OUTPUTTuple #

Slack

9

• Pros• Simple• Handles “jitter” (slightly out of

order arrival)

• Cons• Introduces delay• Permanently drops arrivals later than buffer• Unbounded buffer size• Permanently drops arrivals if lulls in multiple

input streams

Traditional (in Order) Solution #2: “Drift”

10

(A,1) (a,2) (A,1)(B,2) (b,3) (a,2), (B,2)(C,3) (c,4) (b,3), (C,3)(G,4) (d,5) (c,4), (G,4)(D,6) (d,5)(E,7) (D,6),(E,7)(R,8) (E,7),(R,8) (D,6)(F,9) (x,5) (R,8),(F,9) (E,7) (z,10) (z,10) (R,8), (F,9)

Source2

2-Second Drift Buffer

OUTPUTSource 1

Drift

11

• Pros• Simple• Handles multiple streams with

short “lulls” in arrival

• Cons• Doesn’t handle streams with dramatically

different arrival rates• Permanently drops data that arrives after drift

window has expired

Traditional Solution #3: Order-agnostic Operators

12

• Slack and Drift aim to order streams before presenting them to order-sensitive operators

• Many operators don’t care about order

SELECT count(*), cq_close(*) tsFROM S <slices ‘5 seconds’>

Out of Order Processing: Count Example

13

1 1 1 2 3 2 3 2 3 4 4 4 5 5 (4,t=5) 6 6 1 7 2 1 8 9 2 9 7 3 10 3 3 11 10 (3,t=10)

Time Stamp

CountState OUTPUT

Tuple #

Heart-Beat

Order-agnostic Operators

14

• Pros• No buffering• No extra delays• Handles out-of-order tuples that

make it before heart-beat

• Cons• Some operators do care about order• Permanently drops data that arrives after

heartbeat• Note: Lost data also impacts bigger “roll up

queries” e.g. <slices 15 seconds> with sharing

So, how to handle very late data and discontinuous streams?

15

16

Integration Framework

Shared Stream Query Processor

Persistent Data Store

SQL Interface

Raw Data Aggregates

“Stream-Relational” Architecture [CIDR 09]

JDBC / JMS XML Flat files ETL tools SOAP APIs

Data Warehouse

App Logic / UDFs

Other TrucQ’s

17

Order-Independent Processing: Overview

• Answers that have already been delivered can only be compensated

• Need to preserve all arriving data • Queries return answers based on

all relevant data that has arrived:• CQ’s: Continuous Queries• SQ’s: SQL queries on archived streams & answers

• Approach: Leverage benefits of SQL(!):• Data-Parallel processing w/on-demand consolidation• Powerful “View” mechanisms

• Basically, create parallel partitions for late data• Rewrite queries as views over partial results


18

1 1 1 2 3 2 3 2 3 4 4 4 5 2 5 6 1 6 7 5 (6,t=5) 8 6 1 9 2 1 1 10 9 2 1 11 7 3 1

DataTS Control

Count State Partitions OUTPUT

Tuple #


19

11 7 3 1 12 3 3 2 13 10 2 (3,t=10) 14 12 1 2 15 8 1 1 (2,t=5) 16 4 1 1 17 3 1 2 18 9 2 2 19 15 2 2 (1,t=15) 20 flush-2 2 (2,t=10) 21 flush-3 (2,t=5)

DataTS Control

Count State Partitions

OUTPUTTuple # (6,t=5)


20

(6,t=5)(3,t=10) (2,t=5)(1,t=15)(2,t=10)(2,t=5)

OUTPUT• Treat output as “Partial State Records”• Rewrite queries using views over PSRs

• i.e., consolidate On-Demand• Paper goes into substantial detail

on how rewrites work• <Slices 5 second>

• Same answer as Order-Insensitive• <Slices 15 second> as roll-up

• Answer contains all data• Subsequent SQs over archived results

and raw data contain all data too!

Handles Very Late Data, Plus You Get…

21

• Parallel Processing – Multicore and Cluster

U

U

D

D

D

D

D

Client

Client

Client

ClientH

igh-

band

wid

th N

etw

ork

Inte

rcon

nect

D = Distributed Processing NodeU = Unified Processing Node

Other Details in the Paper

22

• Beyond late data and parallelism, approach also is key to supporting:• Fault Tolerance using replication• High-Availability via fast restart• “Nostalgic” continuous queries that start in the

past and catch up to the present• Fast concurrent creation of archives for new CQs

• Algorithmic/Systems details on• Integration with overall system architecture• Interaction with Transaction Mechanism• Need for Background Reducer task• Hybrid Plans for non-parallelizable parts of queries

Conclusions

23

• Early Stream Processing Systems were based on simplistic assumptions about ordering

• Truviso’s 3.2 engine incorporates a new mechanism so no data is permanently dropped

• Approach leverages strengths of SQL• Data-parallel processing models• Sophisticated and efficient view functionality

• Key is On-Demand Consolidation• Of course, you can only do it if you have an

integrated stream-relational systemFor more info: [email protected] or [email protected]

Documents

Continuous Analytics Over Discontinuous Streams