41
Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data stream management

Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Embed Size (px)

Citation preview

Page 1: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Aurora

Group 19 :

Chu Xuân Tình

Trần Nhật Tuấn

Huỳnh Thái Tâm

Lec:

Associate Professor Dr.techn. Dang Tran Khanh

A new model and architecture for data stream management

Page 2: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Outline

2

The Aurora stream query algebra

Run–time Architecture

Introduction

Page 3: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Aurora-system architecture

Aurora: a new model and architecture for data stream management, a new system to manage data streams for monitoring applications.

The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires

Aurora - a new DBMS currently under construction at Brandeis University, Brown University, and M.I.T.

3

Page 4: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Currently used DB systems

Classical DBMS: Passive repository storing data (HADP – human-active, DBMS-

passive model) Only current state of data is important Data synchronized; queries have exact answers (no support for

approximation) Monitoring applications are difficult to implement in traditional

DBMS First, the basic computation model is wrong: DBMSs have a

HADP model while monitoring applications often require a DAHP model.

Triggers and alerters are second-class citizens Problems with getting required data from historical time series Development of dedicated middleware is expensive

Conclusion: these systems are ill suited for applications used to alert human when abnormal situation occurs (expected DAHP model – DBMS-active, human-passive)

4

Page 5: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Aurora – main assumptions

Data comes from various, uniquely identified data sources (data streams)

Each incoming tuple is timestamped Aurora is expected to process incoming streams Tuples are transferred through loop-free, directed graph Outputs from the system are presented to applications Maintains historical storage

5

Page 6: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

6

Page 7: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Aurora system overview

7

Any box can filter stream (select operation)

Box can compute stream aggregates applying aggregate function accross a window of values in the stream

Output of any box can be an input for several other boxes (split operation)

Each box can gather tuples from many inputs (union operation)

Page 8: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Aurora query model

8

b1

b7

b2

b6

b5b4

b3 Appl

Appl

Connection points

Storage S1 Storage S2

Storage S3

Continuous query

View

Ad-hoc query

„Keep 2 hr”

QoS spec

QoS spec

QoS spec

Each CP and view should have a persistence specification (e.g. „keep data for 2 hr”)

Each output is associated with QoS specification (helps to allocate the processing elements along the path)

Page 9: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Queries in the aurora

Continuous queries Query continuously processes tuples Output tuples are delivered to an application

Ad-hoc queries System will process data and deliver answer from the earliest time

stored in the connection point Semantic is the same as continuous query that started execution at

tnow – (persistence specification) Query continues until explicit termination

Views Similar to materialized or partially-materialized views in classical

DB systems Application may connect to the end of this path whenever there is a

need9

Page 10: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Queries in the aurora

Connection points Support for dynamic modification of network Support for data caching (persistence specification) – helpful

for ad-hoc queries Connection point without upload stream can be used as a

stored data set (like in classical DBMS) Tuples from connection point can be pushed through the

system (e.g when connection point is „materialized” and stored tuples are passed as a stream to the downstream nodes)

Alternatively, downstream node can pull the data (helpful in the execution of filtering or joining operations)

10

Page 11: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Application Domains Online Auctions Network Traffic Management Habitat Monitoring Military Logistics Immersive Environments Road Traffic Monitoring System Monitoring

11

Page 12: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

SQuAl

The Aurora [S]tream [Qu]ery [Al]gebra 7 operators:

Order-agnostic (Filter, Map, Union) Order-sensitive (BSort, Aggregate, Join, Resample)

Model:

A stream is an append-only sequence of tuples with uniform type

A stream type has the form:(TS, A1,…, An)

Steam tuples have the form:(ts, v1,…, vn)

Ai: application-specific data fields

ts: timestamp

Page 13: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Order-agnostic operators

Input tuples have the form:

t = (TS = ts, A1 = v1,…, Ak = vk) 3 operators:

Filter:• similar to relational selection• filter on multiple predicates• route tuples according to which predicates they satisfy

Map:• similar to relational projection• apply arbitrary functions to tuples (including user-defined

functions)

Union:• merge 2 or more streams of common schema

Page 14: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Filter

Acts much like a case statement Can be used to route input tuples to alternative streams Form:

Filter(P1,…,Pm)(S)• Pi: predicates over tuples on the input stream S

Its output consists of m + 1 streams Output tuples have the same schema and values as input

tuples, including their QoS timestamp

Page 15: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Map

Is a generalized projection operator Form:

Map(B1 = F1,…, Bm = Fm)(S)• Bi: name of attribute

• Fi: function over tuple on the input stream S

Output tuple for each input tuple t has the form:

(TS = t.TS, B1 = F1(t),…, Bm = Fm(t)) Resulting stream can have a different schema than the input

stream, but the timestamps of input tuples are preserved in corresponding output tuples

Page 16: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Union

Is used to merge 2 or more streams into a single output stream Form:

Union(S1,…,Sn)• Si: stream, common schema

Union can output tuples in any order Output tuples have the same schema and values as input tuples

including their QoS timestamps

Page 17: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Order-sensitive operators

Require order specification arguments Order specification: describes the tuples arrival order they expect Order specifications have the form:

Order(On A, Slack n, GroupBy B1,…,Bm)• A, Bi: attribute• n: non-negative integer

4 operators:

Bsort:• is an approximate sort operator with semantics equivalent to a bounded pass

bubble sort

Aggregate:• applies a window function to sliding windows over its input stream

Join:• is a binary operator that resembles a band join• applied to infinite streams

Resample:• is an interpolation operator used to align streams

Page 18: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

BSort

Is an approximate sort operator Form:

Bsort(Assuming O)(S)• O = Order(On A, Slack n, GroupBy B1,…,Bm) is a

specification of the assumed ordering over the output stream

Performs a buffer-based approximate sort

Equivalent to n passes of a bubble sort

Page 19: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

BSort

Page 20: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Aggregate

Applies “window functions” to sliding windows over its input stream Form:

Aggregate(F, Assuming O, Size s, Advance i)(S)• F: “window function” (SQL-type aggregate operation, Postgres-style

user-defined function)• O = Order(On A, Slack n, GroupBy B1,…,Bm) is an order specification

over input stream S• s: size of the window (measured in terms of values of A)• i: integer, predicate that specifies how to advance the window when it

slides Output tuples have the form:

(TS = ts, A = a, B1 = u1,…, Bm = um) ++ (F(W))• W: “window” of tuples from the input stream with values of A between a

and a + s – 1• ts: the smallest timestamps associated with tuples in W• ++: denotes concatenation of 2 tuples

Page 21: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Aggregate

Page 22: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Aggregate

Slack = 1 or more Blocking: waiting for lost or late tuples to

arrive in order to finish window calculations

Optional Timeout argument:• Aggregate(F, Assuming O, Size

s, Advance i, Timeout t)

Page 23: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Join

Is a binary join operator Form:

Join(P, Size s, Left Assuming O1, Right Assuming O2)(S1, S2)• P: predicate over pairs of tuples from input streams S1 and S2

• s: integer• O1: order specification on some numeric or time-based attribute of

S1 (A)

• O2: order specification on some numeric or time-based attribute of S2 (B)

For every in-order tuple t in S1 and u in S2, the concatenation of t and u (t++u) is output if:

• |t.A – u.B| ≤ s• P holds of t and u

The QoS timestamp for the output tuple is the minimum timestamp of t and u

Page 24: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Join

Page 25: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Resample

Is an asymmetric, semijoin-like synchronization operator Can be used to align pairs of streams Form:

Resample(F, Size s, Left Assuming O1, Right Assuming O2)(S1, S2)• F: “window function” over S1

• s: integer• O1: order specification on some numeric or time-based attribute of

S1 (A)

• O2: order specification on some numeric or time-based attribute of S2 (B)

For every tuple t from S1, output tuple:

(B1 : u.B1,..., Bm : u.Bm, A : t.A) + +F(W(t))• W(t) = {u S∈ 2|u in order wrt O2 in S2 |t.A − u.B| ≤ s}∧

Page 26: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Resample

Page 27: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Run-time architecture

Router

Scheduler

Load Shedder

QoS Monitor

Storage manager

Box Processors

Q1

Q2

Qi

Qn

Qj

Buffer Manager

Persistent Storage

OutputsInputs

Microsoft Office User
1.Router Routes tuples in the system Forwards them either to outputs or to the storage manager2.Storage manager Responsible for maintaining the box queues and managing the buffer3.Scheduler Decides which box will be processed. The scheduler pays special attention to reducing operator scheduling and invocation overheads. In particular, the scheduler batches (i.e., groups) multiple tuples and operators and executes each batch at once.4.Box processor Executes the appropriate operation Forwards output to router5.QoS monitor Observes outputs and activates load shedder6.Load shedder Shades load till the performance reaches the acceptable level
Page 28: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Quality of Server - QoS

QoS, in general, is a multidimensional function of several attributes of an Aurora system. Response times (production of output tuples) Tuple drops Values produced (importance of produced values)

Administrator specifies QoS graphs for output based on one or more of mentioned functions

Other types of QoS functions can be defined too

Page 29: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

QoS graphs

Graphs are expected to be normalized Graphs should allow a properly sized network to operate with

all outputs in a ‘good zone’ Graphs should be convex (the value-based graph is an

exception)

1

0Delay

1

0% tuples delivered

1

0Output value

good zone

Page 30: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Aurora Storage Manager (ASM) – Queues management

There is one queue at the output of each box; this queue is shared by all successor boxes

Queues are stored in memory and on disksQueues may change length

b2 b1

timeQueue organization

Processed tuples

Page 31: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Scheduling in Aurora

Scheduler (and Aurora) aims to reduce overall tuple execution cost

Exploit of two nonlinearities in tuple processing Interbox nonlinearity:

• Minimaze tuple trashing (if buffer space is not sufficient tuples has to be shuttled between memory and disk)

• Avoiding to copy data from output to buffer (a possibility of bypassing ASM when one box is scheduled right after another)

Intrabox nonlinearity: • The cost of tuple processing may decrease as the number of

available tuples in the queue increases

Page 32: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Scheduling in Aurora

Aurora’s approach: (1) have box queues as many tuples as possible, (2) process it at once – train scheduling, and (3) pass them to subsequent boxes without going to disk – superbox scheduling

Two goals: (1) minimize number of I/O operations and (2) minimize number of box calls per tuple

Page 33: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Scheduler performanceT

ime

(ms)

0

50

100

150

200

250

300Execution costs

Scheduling overhead

Tuple at a time Trains Superboxes

Page 34: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Priorities assignment in Scheduler

The latency of each output tuple is the sum of the tuple’s processing delay and its waiting delay (is primarily the function of scheduling)

The goal of scheduler: to assign priorities to boxes outputs that maximize the overall QoS

The Scheduler’s approach is divided into two aspects: state-based analysis that assigns priorities to outputs

and picks for scheduling the output with the highest utility

feedback-based analysis that observes overall system and increases the priorities of outputs not doing well (base on QoS graph)

Microsoft Office User
1. State based : In this approach, the utility of an output can be determined by computing how much QoS will be sacrificed if the execution of the output is deferred
Microsoft Office User
2.Feedback-based : A second, feedback-based approach continuously observes the performance of the system and dynamically re- assigns priorities to outputs, properly increasing the priorities of those that are not doing well and decreasing priorities of the applications that are already in their good zones.
Page 35: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Load shedding

Reaction to overloadDrop is a system level operator that enables to

drop randomly tuples from stream at specified rate

1. Load shedding by dropping tuples2. Load shedding by filtering tuples

Microsoft Office User
has two potential problems: (1) overall system utility might be degraded more than necessary and (2) application semantics might be arbitrarily affected. In order to alleviate these problems, Aurora relies on QoS information to guide the load-shedding process.
Page 36: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Load shedding

Load shedding by dropping tuples

Reduces the amount of Aurora processing by dropping randomly selected tuples at strategic points in the network

Microsoft Office User
Chúng tôi đầu tiên xác định đầu ra với các dốc âm nhỏ nhất cho QoS đồ thị tương ứng.
Page 37: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Load shedding

Load shedding by filtering tuples Idea: remove less important tuples rather

than randomly chosen It use value-based QoS information

Page 38: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data
Page 39: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Questions

1:Which of the following operators output tuples that have the same schema and values as input tuples?a. Aggregateb. b. BSort (x)c. Filter (x)d. Joine. e. Mapf. Resampleg. Union (x)

Page 40: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Questions

2. What does Aurora's primary run-time architecture include?a. Routerb. Storage manager (x)c. Scheduler (x)d. Box processor. e. QoS monitor (x)f. Resampleg. Load shedder (x)

Page 41: Aurora Group 19 : Chu Xuân Tình Trần Nhật Tuấn Huỳnh Thái Tâm Lec: Associate Professor Dr.techn. Dang Tran Khanh A new model and architecture for data

Three broad application types

Aurora addresses three broad application types in a single, unique framework:1.Real-time monitoring applications continuously monitor the present state of the world and are, thus, interested in the most current data as it arrives from the environment. In these applications, there is little or no need (or time) to store such data.2.Archival applications are typically interested in the past. They are primarily concerned with processing large amounts of finite data stored in atime-series repository.3.Spanning applications involve both the present and past states of the world, requiring combining and comparing incoming live data and stored historical data. These applications are the most demanding as there is a need to balance real-time requirements with efficient processing of large amounts of disk-resident data.