1 Shetal Shah, IITB Dissemination of Dynamic Data: Semantics, Algorithms, and Performance

1

Shetal Shah, IITB

Dissemination of Dynamic Data:Semantics, Algorithms, and Performance

2

More and more of the informationwe consumeis dynamically constructed…

Ad Component

Headline Component

Headline Component

Headline Component

Headline Component

Personalized Component

Navig

ati

on C

om

ponent

3

Buying a camera? Track auctions…

4

Dynamic Data

Data gathered by (wireless sensor) networks Sensors that monitor light, humidity, pressure,

and heat Network traffic passing via switches

Sports Scores Score changes by 5 points

Financials Rice price changes by Rs. 10 compared to

previous day Total value of stock portfolio exceeds $ 10,000

rapid and unpredictable changestime critical, value critical

used in on-line decision making

5

Continual Queries

A CQ is a standing query coupled with a trigger/select condition

CQ stock_monitor SELECT stock_price FROM quotes ; WHEN stock_price – prev_stock_price > $0.5

CQ RFP_tracker: SELECT project_name, contact_info FROM RFP_DB; WHERE skill_set_required ⋐ available_skills

Not every change at a source leads to a change in the result of the query

6

Generic Architecture

Data sources

Proxies/caches/Data aggreators

End-hosts

servers

sensors

wired host

mobile host

Netw

ork

Netw

ork

7

Where should the queries execute ?

At clients can’t optimize across clients, links

At source (where changes take place) Advantages

Minimum number of refresh messages, high fidelity Main challenge

Scalability Multiple sources hard to handle

At Data Aggregators -- DAs/proxies -- placed at edge of network Advantages

Allows scalability through consolidation, Multiple data sources Main challenge

Need mechanisms for maintaining data consistency at DAs

8

Coherency of Dynamic Data Strong coherency

The client and source always in sync with each other

Strong coherency is expensive! Relax strong coherency: - coherency

Time domain: t - coherency The client is never out of sync with the source

by more than t time units eg: Traffic data not stale by more than a minute

Value domain: v - coherency The difference in the data values at the client

and the source bounded by v at all times eg: Only interested in temperature changes

larger than 1 degree

9

Coherency Requirement (c )

temperature, max incoherency = 1 degree

ctStU )()(

SourceS(t)

ClientU(t)

10

T

at ServerData/Query Value at client

Violation

Bounds

11

Source pushes interesting changes

+ Achieves v - coherency

+ Keeps network overhead minimum-- poor scalability (has to maintain state and keep connections

open)

Source DA Userpush push

12

Pull – interesting changes

Pull after Time to Live (TTL) Time To Next Refresh (TTR / TNR)

+ Can be implemented using the HTTP protocol+ Stateless and hence is generally scalable with

respect to state space and computation

Need to estimate when a change of interest will happen

Heavy polling for stringent coherence requirement or highly dynamic data

Network overheads higher than for Push

Server Repository UserPull

13

Complementary Properties

Overheads (Scalability) Algorithm Resiliency Temporal Coherency (fidelity)

Comm..

Comp. State Space

Push Low High Low High High Pull

High Low (for tight coherency)

High (for loose coherency)

High Low Low

14

Dynamic Content Distribution Networks

To create a scalable

content dissemination network (CDN)

for streaming/dynamic data.

Metric:

Fidelity: % of time coherency requirement is met

15

Dissemination Network: Example

Data Set: p, q, r Max Clients : 2

Source

p: 0.2, q : 0.2 r: 0.2

p: 0.4, r: 0.3 q: 0.3

AB

D C

16

Challenges – I

Given the data and coherency needs of repositories, how should repositories cooperate to satisfy these

needs?

How should repositories refresh the data such that coherency requirements of dependents are satisfied?

How to make repository network resilient to failures? [VLDB02, VLDB03, IEEE TKDE]

17

Challenges - II

Given the data and the coherency available at repositories in the network,

how to assign clients to repositories? Given the data and coherency needs of clients in the

network, what data should reside in each repository and at what coherency? If the client requirements keep changing, how and when should the repositories be

reorganized ? [RTSS 2004, VLDB 2005]

18

Dynamics along three axes

Data is dynamic, i.e., data changes rapidly and unpredictably

Data items that a client is interested in also change dynamically Network is dynamic, nodes come and go

19

Data Dissemination

20

Data Dissemination

Different users have different coherency req for the same data item.

Coherency requirement at a repository should be at least as stringent as that of the dependents.

Repositories disseminate only changes of interest.

Source

p:0.2, q:0.2 r:0.2

p:0.4, r: 0.3

q: 0.4

q: 0.3

A B

D C

Client

21

Data dissemination -- must be done with care

Source Repository P Repository Q

3.0Pc 5.0Qc

should prevent missed updates!

1.2

1.5

1

1.41.4

1.7

1.4

1 1 1

1

1.7 1.7

11

22

Source Based Dissemination Algorithm

For each data item, source maintains unique coherency requirements of

repositories the last update sent for that coherency

For every change, source finds the maximum coherency for which it must be disseminated tags the change with that coherency disseminates (changed data, tag)

23

Source Based Dissemination Algorithm


3.0Pc 5.0Qc

1.2

1.5

1

1.41.4

1.7

1 1 1

1

1.5

1

1.51.5 1.5

24

QQP cxx

Repository Based Dissemination Algorithm

A repository P sends changes of interest to the dependent Q if

Pc

25

Repository BasedDissemination Algorithm


3.0Pc 5.0Qc

1.2

1.5

1

1.41.4

1.7

1.4

1 1 1

1

1.7 1.7

1.41.4

26

Building the content distribution network

Choose parents for repositories

such that overall fidelity observed by the repositories is high

---reduce communication and computational delays..

27

If parents are not chosen judiciously

It may result in Uneven distribution of

load on repositories. Increase in the

number of messages in the system.

Increase in loss in fidelity!

Source

p:0.2, q:0.2 r:0.2

p:0.4, r: 0.3q: 0.3

AB

D C

28

DiTA

Repository N needs data item x If the source has available push

connections, or the source is the only node in the dissemination tree for x

N is made the child of the source Else

repository is inserted in most suitable subtree where

N’’s ancestors have more stringent coherency requirements

N is closest to the root

29

Most Suitable Subtree?

l: smallest level in the subtree with coherency requirement less stringent than N’’s.

d: communication delay from the root of the subtree to N.

smallest (l x d ): most suitable subtree.

Essentially, minimize communication andcomputational delays!

30

Example

Source

Initially the network consists of the source.

q: 0.2 A

A and B request service of q with coherency requirement 0.2

q: 0.2B

C requests service of q with coherency requirement 0.1

q: 0.1 C

q: 0.2 A

31

Example

D requests service of q with coherency requirement 0.2

Source

q: 0.1q: 0.2

q: 0.2 q: 0.2

q: 0.3 q: 0.5 q: 0.4

q: 0.3 q: 0.2 q: 0.3

q: 0.3 q: 0.5

q: 0.4

32

Example

D requests service of q with coherency requirement 0.2

Source

q: 0.1q: 0.2

q: 0.2 q: 0.2

q: 0.3 q: 0.5 q: 0.4

q: 0.3 q: 0.2 q: 0.2

q: 0.3

q: 0.4

q: 0.5

q: 0.3

D

33

Resiliency

34

Handling Failures in the Network

Need to detect permanent/transient failures in the network and to recover from them

Resiliency is obtained by adding redundancy Without redundancy,

failures loss in fidelity Adding redundancy can increase cost possible loss of fidelity!

Handle failures such that cost of adding resiliency is low!

35

Passive/Active Failure Handling

Passive failure detection: Parent sends I’m alive messages at the end

of every time interval. what should the time interval be?

Active failure handling: Always be prepared for failures. For example: 2 repositories can serve the

same data item at the same coherency to a child.

This means lots of work greater loss in fidelity.

36

Middle Path

A backup parent B is found for each data item that the repository needs

Let repository R want data item x with coherency c.

P

R

B serves R with coherency k × c

k × cc

B

At what coherency should B serve R ?

37

If a parent fails

Detection: Child gets two consecutive updates from the backup parent with no updates from the parent

B

R

k x cc

P

c Recovery: Backup

parent is asked to serve at coherency c till we get an update from the parent

38

Adding Resiliency to DiTA

A sibling of P is chosen as the backup parent of R.

If P fails, A serves B with coherency c change is local. If P has no siblings, a sibling

of nearest ancestor is chosen.

Else the source is made the backup parent.

B

R

k x cc

P

A

39

Markov Analysis for k

Assumptions Data changes as a random walk along

the line The probability of an increase is the

same as that of a decreaseNo assumptions made about the unit of

change or time taken for a changeExpected # misses for any k <= 2 k2 – 2

for k = 2, expected # misses <= 6

40

Experimental Methodology

Physical network: 4 servers, 600 routers, 100 repositories Communication delay: 20-30 ms Computation delay: 3-5 ms Real stock traces: 100-1000 Time duration of observations: 10,000 s Tight coherency range: 0.01 to 0.05 loose coherency range: 0.5 to 0.99

41

Failure and Recovery Modelling

Failures and recovery modeled based on trends observed in practice

Analysis of link failures in an IP backbone by G. Iannaccone et al

Internet Measurement Workshop 2002

Recovery:10% > 20 min 40% > 1 min & < 20 min 50% < 1 min

Trend for time between failure:

42

In the Presence of Failures, Varying Recovery Times

Addition of resiliency does improve fidelity.

43

In the Presence of Failures, Varying Data Items

Increasing Load

Fidelity improves with addition of resiliency even for large number of data items.

44

In the Absence of Failures

Increasing Load

Often, fidelity improves with addition of resiliency, even in the absence of failures!

45

Beyond Resiliency

Scheduling Assigning clients to repositories Balancing load in the network Handling queries

46

Acknowledgements

Allister Bernard & Vivek Sharma S. Dharmarajan Shweta Agarwal T. Siva Prof. C. Ravishankar Prof. Sohoni and Prof. Rangaraj Prof. S. Sudarshan Prof. Krithi Ramamritham

Documents

1 Shetal Shah, IITB Dissemination of Dynamic Data: Semantics, Algorithms, and Performance