View
224
Download
0
Tags:
Embed Size (px)
Citation preview
1p2p, Fall 05
Topics in Database Systems: Data Management in Peer-to-Peer Systems
Search & Replication in Unstructured P2p
2p2p, Fall 05
Overview
Centralized
Constantly-updated directory hosted at central locations (do not scale well, updates, single points of failure)
Decentralized but structured
The overlay topology is highly controlled and files (or metadata/index) are not placed at random nodes but at specified locations
Decentralized and Unstructured
Peers connect in an ad-hoc fashion
The location of document/metadata is not controlled by the system
No guaranteed for the success of a search
No bounds on search time
3p2p, Fall 05
Overview
Blind Search and Variations
No information about the location of items
Informed Search
Maintain (localized) index information
Local and Routing Indexes
Trade-off: cost of maintaining the indexes (when joining/leaving/updating) vs cost for search
4p2p, Fall 05
Blind Search
Flood-based
each node contact its neighbors, which in turn contact their neighbors, until the item is located
Exponential search time
No guarantees
5p2p, Fall 05
Blind Search: Issues
BFS vs DFS: better response time, more messages
Iterative vs Recursive (return path)
TTL (time to leave) parameter
Cycles (duplicate messages)
Connectivity
Power-Law Topologies: the ith node with most connections has ω/ia neighbors
6p2p, Fall 05
Gnutella: Summary
• Completely decentralized• Hit rates are high• High fault tolerance• Adopts well and dynamically to changing peer populations• Protocol causes high network traffic (e.g., 3.5Mbps). For example:
– 4 connections C / peer, TTL = 7– 1 ping packet can cause packets
• No estimates on the duration of queries can be given• No probability for successful queries can be given
• Topology is unknown algorithms cannot exploit it
• Free riding is a problem• Reputation of peers is not addressed
• Simple and robust
240,26)1(**20
TTL
i
iCC
7p2p, Fall 05
Summary and Comparison of Approaches
Paradigm Search TypeSearch Cost (messages)
Autonomy
GnutellaBreadth-first search on graph
String comparison
very high
FreeNetDepth-first search on graph
String comparison
O(Log n) ? very high
ChordImplicit binary search trees
Equality O(Log n) restricted
CANd-dimensional space
Equality O(d n (̂1/d)) high
P-GridBinary prefix trees
Prefix O(Log n) high
TTL
i
iCC0
)1(**2
8p2p, Fall 05
More on Search
Search Options– Query Expressiveness (type of queries)– Comprehensiveness (all or just the first (or k) results)
– Topology– Data Placement– Message Routing
9p2p, Fall 05
Comparison
Gnutella CAN Others?
Expressivness
Comprehensivness
Autonomy
Efficiency
Robustness
Topology pwr law
Data Placement arbitrary
Message Routing flooding
10p2p, Fall 05
Comparison
Gnutella CAN Others?
Expressivness
Comprehensivness
Autonomy
Efficiency
Robustness
Topology pwr law grid
Data Placement arbitrary hashing
Message Routing flooding directed
11p2p, Fall 05
• Client-Server performs well– But not always feasible
• Ideal performance is often not the key issue!
• Things that flood-based systems do well– Scaling– Decentralization of visibility and liability– Finding popular stuff (e.g., caching)– Fancy local queries
• Things that flood-based systems do poorly– Finding unpopular stuff – Fancy distributed queries– Guarantees about anything (answer quality, privacy, etc.)
12p2p, Fall 05
Blind Search Variations
Expanding Ring or Iterative Deepening:
Start BFS with a small TTL and repeat the BFS at increasing depths if the first BFS fails
Works well when there is some stop condition and a “small” flood will satisfy the query
Else even bigger loads than standard flooding
Appropriate when hot objects are replicated more widely than cold objects
Modified-BFS:
Choose only a ratio of the neighbors (some random subset)
13p2p, Fall 05
Blind Search MethodsRandom Walks:
The node that poses the query sends out k query messages to an equal number of randomly chosen neighbors
Each step follows each own path at each step randomly choosing one neighbor to forward it
Each path – a walker
Two methods to terminate each walker: TTL-based or
checking method (the walkers periodically check with the query source if the stop condition has been met)
It reduces the number of messages to k x TTL in the worst case
Some kind of local load-balancing
14p2p, Fall 05
Blind Search Methods
Random Walks
In addition, the protocol bias its walks towards high-degree nodes (choose the highest degree neighbor)
15p2p, Fall 05
Topics in Database Systems: Data Management in Peer-to-Peer Systems
Q. Lv et al, “Search and Replication in Unstructured Peer-to-Peer Networks”, ICS’02
16p2p, Fall 05
Search and Replication in Unstructured Peer-to-Peer Networks
Type of replication depends on the search strategy used
(i) A number of blind-search variations of flooding
(ii) A number of (metadata) replication strategies
Evaluation Method: Study how they work for a number of different topologies and query distributions
17p2p, Fall 05
Methodology
Performance of search depends on
Network topology: graph formed by the p2p overlay network
Query distribution: the distribution of query frequencies for individual files
Replication: number of nodes that have a particular file
Assumption: fixed network topology and fixed query distribution
Results still hold, if one assumes that the time to complete a search is short compared to the time of change in network topology and in query distribution
18p2p, Fall 05
Network Topology
(1) Power-Law Random Graph
A 9239-node random graph
Node degrees follow a power law distribution
when ranked from the most connected to the least, the i-th ranked has
ω/ia, where ω is a constant
Once the node degrees are chosen, the nodes are connected randomly
19p2p, Fall 05
Network Topology
(2) Normal Random Graph
A 9836-node random graph
20p2p, Fall 05
Network Topology
(3) Gnutella Graph (Gnutella)
A 4736-node graph obtained in Oct 2000
Node degrees roughly follow a two-segment power law distribution
21p2p, Fall 05
Network Topology
(4) Two-Dimensional Grid (Grid)
A two dimensional 100x100 grid
22p2p, Fall 05
Query Distribution
Assume m objects
Let qi be the relative popularity of the i-th object (in terms of queries issued for it)
Values are normalized Σ i=1, m qi = 1
(1) Uniform: All objects are equally popular
qi = 1/m
(2) Zipf-like
qi 1 / iα
23p2p, Fall 05
Query Distribution & Replication
When the replication is uniform, the query distribution is irrelevant (since all objects are replicated by the same amount, search times are equivalent for both hot and cold items)
When the query distribution is uniform, all three replication distributions are equivalent (uniform!)
Thus, three relevant combinations query-distribution/replication
(1) Uniform/Uniform
(2) Zipf-like/Proportional
(3) Zipf-like/Square-root
24p2p, Fall 05
Metrics
Pr(success): probability of finding the queried object before the search terminates
#hops: delay in finding an object as measured in number of hops
25p2p, Fall 05
Metrics
#msgs per node: Overhead of an algorithm as measured in average number of search messages each node in the p2p has to process
#nodes visited
Percentage of message duplication
Peak #msgs: the number of messages that the busiest node has to process (to identify hot spots)
These are per-query measures
An aggregated performance measure, each query convoluted with its probability
26p2p, Fall 05
Limitation of Flooding
There are many duplicate messages (due to cycles) particularly in high connectivity graphs
Multiple copies of a query are sent to a node by multiple neighbors
Avoiding cycles, decreases the number of links
Duplicated messages can be detected and not forwarded - BUT, the number of duplicate messages can still be excessive and worsens as TTL increases
Choice of TTL
Too low, the node may not find the object, even if it exists, too high, burdens the network unnecessarily
27p2p, Fall 05
Limitation of Flooding: Comparison of the topologies
Power-law and Gnutella-style graphs particularly bad with flooding
Highly connected nodes means higher duplication messages, because many nodes’ neighbors overlap
Random graph best,
Because in truly random graph the duplication ratio (the likelihood that the next node already received the query) is
the same as the fraction of nodes visited so far, as long as that fraction is small
Random graph better load distribution among nodes
28p2p, Fall 05
Random Walks
Experiments show that
16 to 64 walkers give good results
checking once at every 4th step a good balance between the overhead of the checking message and
the benefits of checking
Keeping state (when the same query reaches a node, the node chooses randomly a different neighbor to forward it)
Improves Random and Grid by reducing up to 30% the message overhead and up to 30% the number of hops
Small improvements for Gnutella and PLRG
29p2p, Fall 05
Random Walks
When compared to flooding:
The 32-walker random walk reduces message overhead by roughly two orders of magnitude for all queries across all network topologies at the expense of a slight increase in the number of hops (increasing from 2-6 to 4-15)
When compared to expanding ring,
The 32-walkers random walk outperforms expanding ring as well, particularly in PLRG and Gnutella graphs
30p2p, Fall 05
Principles of Search
Adaptive termination is very important
Expanding ring or the checking method
Message duplication should be minimized
Preferably, each query should visit a node just once
Granularity of the coverage should be small
Increase of each additional step should not significantly increase the number of nodes visited
31p2p, Fall 05
Replication
32p2p, Fall 05
Types of Replication
Two types of replication
Metadata/Index: replicate index entries
Data/Document replication: replicate the actual data (e.g., music files)
33p2p, Fall 05
Types of Replication
Caching vs Replication
Cache: Store data retrieved from a previous request (client-initiated)
Replication: More proactive, a copy of a data item may be stored at a node even if the node has not requested it
34p2p, Fall 05
Reasons for Replication
Reasons for replication
Performance
load balancing
locality: place copies close to the requestor
geographic locality (more choices for the next step in search)
reduce number of hops
Availability
In case of failures
Peer departuresBesides storage, cost associated with replication: Consistency Maintenance
Make reads faster in the expense of slower writes
35p2p, Fall 05
• No proactive replication (Gnutella)– Hosts store and serve only what they requested– A copy can be found only by probing a host with a
copy
• Proactive replication of “keys” (= meta data + pointer) for search efficiency (FastTrack, DHTs)
• Proactive replication of “copies” – for search and download efficiency, anonymity. (Freenet)
36p2p, Fall 05
Issues
Which items (data/metadata) to replicate
Based on popularity
In traditional distributed systems, also rate of read/write
cost benefit:
the ratio: read-savings/write-increase
Where to replicate (allocation schema)
More Later
37p2p, Fall 05
Issues
How/When to update
Both data items and metadata
38p2p, Fall 05
“Database-Flavored” Replication Control Protocols
Lets assume the existence of a data item x with copies x1, x2, …, xn
x: logical data item
xi’s: physical data items
A replication control protocol is responsible for mapping each read/write on a logical data item (R(x)/W(x)) to a set of read/writes on a (possibly) proper subset of the physical data item copies of x
39p2p, Fall 05
One Copy Serializability
Correctness
A DBMS for a replicated database should behave like a DBMS managing a one-copy (i.e., non-replicated) database insofar as users can tell
One-copy serializable (1SR)
the schedule of transactions on a replicated database be equivalent to a serial execution of those transactions on a one-copy database
One-copy schedule: replace operation of data copies with operations on data items
40p2p, Fall 05
ROWA
Read One/Write All (ROWA)
A replication control protocol that maps each read to only one copy of the item and each write to a set of writes on all physical data item copies.
Even if one of the copies is unavailable an update transaction cannot terminate
41p2p, Fall 05
Write-All-Available
Write-all-available
A replication control protocol that maps each read to only one copy of the item and each write to a set of writes on all available physical data item copies.
42p2p, Fall 05
Quorum-Based Voting
Read quorum Vr and a write quorum Vw to read or write a data item
If a given data item has a total of V votes, the quorums have to obey the following rules:
1. Vr + Vw > V
2. Vw > V/2
Rule 1 ensures that a data item is not read or written by two transactions concurrently (R/W)
Rule 2 ensures that two write operations from two transactions cannot occur concurrently on the same data item (W/W)
43p2p, Fall 05
Distributing Writes
Immediate writes
Deffered writes
Access only one copy of the data item, it delays the distribution of writes to other sites until the transaction has terminated and is ready to commit.
It maintains an intention list of deferred updates
After the transaction terminates, it send the appropriate portion of the intention list to each site that contains replicated copies
Optimizations – aborts cost less – may delay commitment – delays the detection of conflicts
Primary or master copy
Updates at a single copy per item
44p2p, Fall 05
Eager vs Lazy Replication
Eager replication: keeps all replicas synchronized by updating all replicas in a single transaction
Lazy replication: asynchronously propagate replica updates to other nodes after the replicating transaction commits
In p2p, lazy replication (or soft state)
45p2p, Fall 05
Update PropagationWho initiates the update:
Push by the server item (copy) that changes
Pull by the client holding the copy
When
Periodic
Immediate
Lazy: when an inconsistency is detected
Threshold-based: Freshness (e.g., number of updates or actual time)
Value
Expiration-Time: Items expire (become invalid) after that time (most often used in p2p)
Stateless or State-full (the “item-owners” know which nodes holds copies of the item)
46p2p, Fall 05
Replication & Structured P2P
47p2p, Fall 05
CHORD
Invariant to guarantee correctness of lookups:
Keep successors nodes up-to-date
Method: Maintain a successor list of its “r” nearest successors on the Chord ring
Why? Availability
How to keep it consistent: Lazy thought a periodic stabilization
Metadata replication or redundancy
48p2p, Fall 05
CHORD
Method: Replicate data associated with a key at the k nodes succeeding the key
Why? Availability
Data replication
49p2p, Fall 05
CAN
Multiple realities
With r realities each node is assigned r coordinated zones, one on every reality and holds r independent neighbor sets
Replicate the hash table at each reality
Availability: Fails only if nodes at both r nodes fail
Performance: Better search, choose to forward the query to the neighbor with coordinates closest to the destination
Metadata replication
50p2p, Fall 05
CAN
Overloading coordinate zones
Multiple nodes may share a zone
The hash table may be replicated among zones
Higher availability
Performance: choices in the number of neighbors, can select nodes closer in latency
Cost for Consistency
Metadata replication
51p2p, Fall 05
CAN
Multiple Hash Functions
Use k different hash functions to map a single key onto k points in the coordinate space
Availability: fail only if all k replicas are unavailable
Performance: choose to send it to the node closest in the coordinated space or send query to all k nodes in parallel (k parallel searches)
Cost for Consistency
Query traffic (if parallel searches)
Metadata replication
52p2p, Fall 05
CAN
Hot-spot Replication
A node that finds it is being overloaded by requests for a particular data key can replicate this key at each of its neighboring nodes
Then with a certain probability can choose to either satisfy the request or forward it
Performance: load balancing
Metadata replication
53p2p, Fall 05
CAN
Caching
Each node maintains a a cache of the data keys it recently accessed
Before forwarding a request, it first checks whether the requested key is in its cache, and if so, it can satisfy the request without forwarding it any further
Number of cache entries per key grows in direct proportion to its popularity
Metadata replication
54p2p, Fall 05
Replication Theory: Replica Allocation
Policies
55p2p, Fall 05
Question: how to use replication to improve search efficiency in unstructured networks with a proactive replication mechanism?
Replication: Allocation Scheme
How many copies of each object so that the search overhead for the object is minimized, assuming that the total amount of storage for objects in the network is fixed
56p2p, Fall 05
Replication Theory
Assume m objects and n nodes Each object i is replicated on ri distinct nodes and the total number of objects stored is R, that is
Σ i=1, m ri = R
Also, pi = ri/R
Assume that object i is requested with relative rates qi, we normalize it by setting
Σ i=1, m qi = 1
For convenience, assume 1 << ri n and that q1 q2 … qm
57p2p, Fall 05
Replication Theory
Assume that searches go on until a copy is found
Searches consist of
randomly probing sites until the desired object is found: search at each step draws a node uniformly at random and asks for a copy
58p2p, Fall 05
Search Example
2 probes 4 probes
59p2p, Fall 05
Replication Theory
The probability Pr(k) that the object is found at the k’th probe is given
Pr(k) =
Pr(not found in the previous k-1 probes) Pr(found in one (the kth) probe) =
(1 – ri/n)k-1 * ri/n
k (search size: step at which the item is found) is a random variable with geometric distribution and θ = ri/n =>
expectation n/ri
60p2p, Fall 05
Replication Theory
Ai: Expectation (average search size) for object i is the inverse of the fraction of sites that have replicas of the object
Ai = n/ri
The average search size A of all the objects (average number of nodes probed per object query)
A = Σi qi Ai = n Σi qi/ri
Minimize: A = n Σi qi/ri
61p2p, Fall 05
Replication Theory
If we have no limit on ri, replicate everything everywhere
Then, the average search size
Ai = n/ri = 1
Search becomes trivial
How to allocate these R replicas among the m objects: how many replicas per object
Assume a limit on R and that the average number of replicas per site ρ = R/n is fixed
62p2p, Fall 05
Uniform Replication
Create the same number of replicas for each object
ri = R/m
Average search size for uniform replication
Ai = n/ri = m/ρ
Auniform = Σi qi m/ρ = m/ρ (m n/R)
Which is independent of the query distribution
It makes sense to allocate more copies to objects that are frequently queried, this should reduce the search size for the more popular objects
63p2p, Fall 05
Proportional Replication
Create a number of replicas for each object proportional to the query rate
ri = R qi
64p2p, Fall 05
Uniform and Proportional Replication
Summary:
• Uniform Allocation: pi = 1/m•Simple, resources are divided equally
• Proportional Allocation: pi = qi
•“Fair”, resources per item proportional to demand• Reflects current P2P practices
Example: 3 items, q1=1/2, q2=1/3, q3=1/6
Uniform Proportional
65p2p, Fall 05
Proportional Replication
Number of replicas for each object:
ri = R qi
Average search size for uniform replication
Ai = n/ri = n/R qi
Aproportioanl = Σi qi n/R qi = m/ρ = Auniform
again independent of the query distribution
Why? Objects whose query rate are greater than average (>1/m) do better with proportional, and the other do better with uniform
The weighted average balances out to be the same
So what is the optimal way to allocate replicas so that A is minimized?
66p2p, Fall 05
Space of Possible Allocations
q i+1/q i ? p i+1/p iAs the query rate decreases, how much does the ratio of allocated replicas behave
Reasonable:p i+1/p i 1
=1 for uniform
67p2p, Fall 05
Space of Possible Allocations
Definition: Allocation p1, p2, p3,…, pm is “in-between” Uniform and Proportional if
for 1< i <m, q i+1/q i < p i+1/p i < 1
(=1 for uniform, = for proportial, we want to favor popular but not too much)
Theorem1: All (strictly) in-between strategies are (strictly) better than Uniform and Proportional
Theorem2: p is worse than Uniform/Proportional if for all i, p i+1/p i > 1 (more popular gets less) OR
for all i, q i+1/q i > p i+1/p i (less popular gets less than “fair share”)
Proportional and Uniform are the worst “reasonable” strategies
68p2p, Fall 05
q2/q1
p2/p
1Space of allocations on 2 items
Worse than prop/uniMore popular item gets less.
Worse than prop/uni
More popular gets more thanits proportional share
Better than prop/uni
Uniform
Proportional
SR
69p2p, Fall 05
So, what is the best strategy?
70p2p, Fall 05
Square-Root Replication
Find ri that minimizes A,
A = Σi qi Ai = n Σi qi/ri
This is done for ri = λ √qi where λ = R/Σi √qi
Then the average search size is
Aoptimal = 1/ρ (Σi √qi)2
71p2p, Fall 05
How much can we gain by using SR ?
wi iq Zipf-like query rates
Auniform/ASR
72p2p, Fall 05
Other Metrics: Discussion
Utilization rate, the rate of requests that a replica of an object i receives
Ui = R qi/ri
For uniform replication, all objects have the same average search size, but replicas have utilization rates proportional to their query rates
Proportional replication achieves perfect load balancing with all replicas having the same utilization rate, but average search sizes vary with more popular objects having smaller average search sizes than less popular ones
73p2p, Fall 05
Replication: Summary
74p2p, Fall 05
Pareto Distribution (for the queries)
Pareto principle: 80-20 rule
80% of the wealth owned by 20% of the population
Zipf: what is the size of the rth ranked
Pareto: how many have size > r
75p2p, Fall 05
Replication (summary)
Each object i is replicated on ri nodes and the total number of objects stored is R, that is
Σ i=1, m ri = R
(1) Uniform: All objects are replicated at the same number of nodes
ri = R/m
(2) Proportional: The replication of an object is proportional to the query probability of the object
ri qi
(3) Square-root: The replication of an object i is proportional to the square root of its query probability qi
ri √qi
76p2p, Fall 05
What is the search size of a query ?
Soluble queries: number of probes until answer is found.
Insoluble queries: maximum search size
Query is soluble if there are sufficiently many copies of the item.
Query is insoluble if item is rare or non existent.
Assumption that there is at least one copy per object
77p2p, Fall 05
• SR is best for soluble queries
• Uniform minimizes cost of insoluble queries
OPT is a hybrid of Uniform and SR
Tuned to balance cost of soluble and insoluble queries.
What is the optimal strategy?
78p2p, Fall 05
UniformSR
10^4 items, Zipf-like w=1.5
All Soluble
85% Soluble
All Insoluble
79p2p, Fall 05
We now know what we need.
How do we get there?
80p2p, Fall 05
Replication Algorithms
• Fully distributed where peers communicate through random probes; minimal bookkeeping; and no more communication than what is needed for search.
• Converge to/obtain SR allocation when query rates remain steady.
Uniform and Proportional are “easy”– Uniform: When item is created, replicate its key in a fixed
number of hosts.– Proportional: for each query, replicate the key in a fixed
number of hosts (need to know or estimate the query rate)
Desired properties of algorithm:
81p2p, Fall 05
Replication - Implementation
Two strategies are popular
Owner Replication
When a search is successful, the object is stored at the requestor node only (used in Gnutella)
Path Replication
When a search succeeds, the object is stored at all nodes along the path from the requestor node to the provider node (used in Freenet)
Following the reverse path back to the requestor
82p2p, Fall 05
Achieving Square-Root Replication
How can we achieve square-root replication in practice?
Assume that each query keeps track of the search size
Each time a query is finished the object is copied to a number of sites proportional to the number of probes
On average object i will be replicated on c n/ri times each time a query is issued (for some constant c)
It can be shown that this gives square root
83p2p, Fall 05
Replication - Conclusion
Thus, for Square-root replication
an object should be replicated at a number of nodes that is proportional to the number of probes that the search required
84p2p, Fall 05
Replication - Implementation
If a p2p system uses k-walkers,
the number of nodes between the requestor and the provider node is 1/k of the total nodes visited (number of probes)
Then, path replication should result in square-root replication
Problem: Tends to replicate nodes that are topologically along the same path
85p2p, Fall 05
Replication - Implementation
Random Replication
When a search succeeds, we count the number of nodes on the path between the requestor and the provider
Say p
Then, randomly pick p of the nodes that the k walkers visited to replicate the object
Harder to implement
86p2p, Fall 05
Achieving Square-Root Replication
What about replica deletion?
Steady state: creation time equal with the deletion time
The lifetime of replicas must be independent of object identity or query rate
FIFO or random deletions is ok
LRU or LFU no
87p2p, Fall 05
Replication: Evaluation
Study the three replication strategies in the Random graph network topology
Simulation Details
• Place the m distinct objects randomly into the network
• Query generator generates queries according to a Poisson process at 5 queries/sec
• Zipf-distribution of queries among the m objects (with a = 1.2)
• For each query, the initiator is chosen randomly
• Then a 32-walker random walk with state keeping and checking every 4 steps
• Each sites stores at most objAllow (40) objects
• Random Deletion
• Warm-up period of 10,000 secs
• Snapshots every 2,000 query chunks
88p2p, Fall 05
Replication: Evaluation
For each replication strategy
What kind of replication ratio distribution does the strategy generate?
What is the average number of messages per node in a system using the strategy
What is the distribution of number of hops in a system using the strategy
89p2p, Fall 05
Evaluation: Replication Ratio
Both path and random replication generates replication ratios quite close to square-root of query rates
90p2p, Fall 05
Evaluation: Messages
Path replication and random replication reduces the overall message traffic by a factor of 3 to 4
91p2p, Fall 05
Evaluation: Hops
Much of the traffic reduction comes from reducing the number of hops
Path and random, better than owner
For example, queries that finish with 4 hops, 71% owner, 86% path, 91% random
92p2p, Fall 05
Summary
• Random Search/replication Model: probes to “random” hosts
• Proportional allocation – current practice• Uniform allocation – best for insoluble queries
• Soluble queries: • Proportional and Uniform allocations are two
extremes with same average performance• Square-Root allocation minimizes Average
Search Size
• OPT (all queries) lies between SR and Uniform• SR/OPT allocation can be realized by simple
algorithms.
93p2p, Fall 05
Replication & Unstructured P2P
epidemic algorithms
94p2p, Fall 05
Replication Policy How many copies
Where (owner, path, random path)
Update Policy Synchronous vs Asynchronous
Master Copy
95p2p, Fall 05
Methods for spreading updates:
Push: originate from the site where the update appeared
To reach the sites that hold copies
Pull: the sites holding copies contact the master site
Expiration times
Epidemics for spreading updates
96p2p, Fall 05
Update at a single site
Randomized algorithms for distributing updates and driving replicas towards consistency
Ensure that the effect of every update is eventually reflected to all replicas:
Sites become fully consistent only when all updating activity has stopped and the system has become quiescent
Analogous to epidemics
A. Demers et al, Epidemic Algorithms for Replicated Database Maintenance, SOSP 87
97p2p, Fall 05
Methods for spreading updates:
Direct mail: each new update is immediately mailed from its originating site to all other sites
Timely & reasonably efficient
Not all sites know all other sites
Mails may be lost
Anti-entropy: every site regularly chooses another site at random and by exchanging content resolves any differences between them
Extremely reliable but requires exchanging content and resolving updates
Propagates updates much more slowly than direct mail
98p2p, Fall 05
Methods for spreading updates:
Rumor mongering:
Sites are initially “ignorant”; when a site receives a new update it becomes a “hot rumor”
While a site holds a hot rumor, it periodically chooses another site at random and ensures that the other site has seen the update
When a site has tried to share a hot rumor with too many sites that have already seen it, the site stops treating the rumor as hot and retains the update without propagating it further
Rumor cycles can be more frequent that anti-entropy cycles, because they require fewer resources at each site, but there is a chance that an update will not reach all sites
99p2p, Fall 05
Anti-entropy and rumor spreading are examples of epidemic algorithms
Three types of sites:
Infective: A site that holds an update that is willing to share is hold
Susceptible: A site that has not yet received an update
Removed: A site that has received an update but is no longer willing to share
Anti-entropy: simple epidemic where all sites are always either infective or susceptible
100p2p, Fall 05
A set S of n sites, each storing a copy of a database
The database copy at site s S is a time varying partial function
s.ValueOf: K {u:V x t :T}
set of keys set of values set of timestamps (totally ordered by <
V contains the element NIL
s.ValueOf[k] = {NIL, t}: item with k has been deleted from the database
Assume, just one item
s.ValueOf {u:V x t:T}
thus, an ordered pair consisting of a value and a timestamp
The first component may be NIL indicating that the item was deleted by the time indicated by the second component
101p2p, Fall 05
The goal of the update distribution process is to drive the system towards
s, s’ S: s.ValueOf = s’.ValueOf
Operation invoked to update the database
Update[u:V] s.ValueOf {r, Now{})
102p2p, Fall 05
Direct Mail
At the site s where an update occurs:
For each s’ S
PostMail[to:s’, msg(“Update”, s.ValueOf)
Each site s’ receiving the update message: (“Update”, (u, t))
If s’.ValueOf.t < t
s’.ValueOf (u, t)
The complete set S must be known to s (stateful server)
PostMail messages are queued so that the server is not delayed (asynchronous), but may fail when queues overflow or their destination are inaccessible for a long time
n (number of sites) messages per update
traffic proportional to n and the average distance between sites
s originator of the update
s’ receiver of the update
103p2p, Fall 05
Anti-Entropy
At each site s periodically execute:
For some s’ S
ResolveDifference[s, s’]
Three ways to execute ResolveDifference:
Push (sender (server) - driven)
If s.Valueof.t > s’.Valueof.t
s’.ValueOf s.ValueOf
Pull (receiver (client) – driven)If s.Valueof.t < s’.Valueof.t
s.ValueOf s’.ValueOf
Push-Pulls.Valueof.t > s’.Valueof.t s’.ValueOf s.ValueOfs.Valueof.t < s’.Valueof.t s.ValueOf s’.ValueOf
s s’
s pushes its value to s’
s pulls s’ and gets s’
value
104p2p, Fall 05
Anti-Entropy
Assume that
Site s’ is chosen uniformly at random from the set S
Each site executes the anti-entropy algorithm once per period
It can be proved that
An update will eventually infect the entire population
Starting from a single affected site, this can be achieved in time proportional to the log of the population size
105p2p, Fall 05
Anti-Entropy
Let pi be the probability of a site remaining susceptible after the i cycle of anti-entropy
For pull,
A site remains susceptible after the i+1 cycle, if (a) it was susceptible after the i cycle and (b) it contacted a susceptible site in the i+1 cycle
pi+1 = (pi)2
For push,
A site remains susceptible after the i+1 cycle, if (a) it was susceptible after the i cycle and (b) no infectious site choose to contact in the i+1 cycle
pi+1 = pi (1 – 1/n)n(1-pi)
1 – 1/n (site is not contacted by a node)
n(1-pi) number of infectious nodes at cycle iPull is preferable than
push
106p2p, Fall 05
Anti-Entropy
compare the whole database instance sent over the network
Use checksums
what about recent updates known only in a few sites
+ A list of recent updates (now - timestamp < threshold τ)
Compare fist recent updates, update the ckecksums and then compare the checksums, choice of τ
Maintain an inverted list of updates ordered by timestamp
Perform anti-entropy by exchanging timestamps at reverse timestamp order until their checksums agree
send only the updates, when to stop
107p2p, Fall 05
Complex Epidemics: Rumor Spreading
Initial State: n individuals initially inactive (susceptible)
Rumor planting&spreading:
We plant a rumor with one person who becomes active (infective), phoning other people at random and sharing the rumor
Every person bearing the rumor also becomes active and likewise shares the rumor
When an active individual makes an unnecessary phone call (the recipient already knows the rumor), then with probability 1/k the active individual loses interest in sharing the rumor (becomes removed)
We would like to know:
How fast the system converges to an inactive state (no one is infective)
The percentage of people that know the rumor when the inactive state is reached
108p2p, Fall 05
Complex Epidemics: Rumor Spreading
Let s, i, r be the fraction of individuals that are susceptible, infective and removed
s + i + r = 1
ds/dt = - s i
di/dt = s i – 1/k(1-s) i
s = e –(k+1)(1-s)
An exponential decrease with s
For k = 1, 20% miss the rumor
For k = 2, only 6% miss it
Unnecessary phone calls
109p2p, Fall 05
Residue
The value of s when i is zero: the remaining susceptible when the epidemic finishes
Traffic
m = Total update traffic / Number of sites
Delay
Average delay (tavg): difference between the time of the initial injection of an update and the arrival of the update at a given site averaged over all sites
The delay until (tlast) the reception by the last site that will receive the update during an epidemic
Criteria to characterize epidemics
110p2p, Fall 05
Blind vs. Feedback
Feedback variation: a sender loses interest only if the recipient knows the rumor
Blind variation: a sender loses interest with probability 1/k regardless of the recipient
Counter vs. Coin
Instead of losing interest with probability 1/k, use a counter so that we loose interest only after k unnecessary contacts
s = e-m
There are nm updates sent
The probability that a single site misses all these updates is (1 – 1/n)nm
Simple variations of rumor spreading
m is the traffic
Counters and feedback improve the delay, with counters playing a more significant role
111p2p, Fall 05
Push vs. Pull
Pull converges faster
If there are numerous independent updates, a pull request is likely to find a source with a non-empty rumor list
If the database is quiescent,
the push phase ceases to introduce traffic overhead,
while the pull continues to inject useless requests for updates
Simple variations of rumor spreading
Counter, feedback and pull work better
112p2p, Fall 05
Minimization
Use a push and pull together, if both sites know the update, only the site with the smaller counter is incremented
Connection Limit
A site can be the recipient of more than one push in a cycle, while for pull, a site can service an unlimited number of requests
What if we set a limit:
Push gets better (reduce traffic, since the spread grows exponentially, most traffic occurs at the end
Pull gets worst
113p2p, Fall 05
Hunting
If a connection is rejected, then the choosing site can “hunt” for alternate sites
Then push and pull similar
114p2p, Fall 05
Complex Epidemic and Anti-entropy
Anti-entropy can be run infrequently to back-up a complex epidemic, so that every update eventually reaches (or is suspended at) every site
What happens when an update is discovered during anti-entropy: use rumor mongering (e.g., make it a hot rumor) or direct mail
115p2p, Fall 05
Deletion and Death Certificates
Replace deleted items with death certificates which carry timestamps and spread like ordinary data
When old copies of deleted items meet death certificates, the old items are removed.
But when to delete death certificates?
116p2p, Fall 05
Dormant Death Certificates
Define some threshold (but some items may be resurrected re-appear”)
If the death certificate is older than the expected time required to propagate it to all sites, then the existence of an obsolete copy of the corresponding data item is unlikely
Delete very old certificates at most sites, retaining “dormant” copies at only a few sites (like antibodies)
Use two thresholds, t1 and t2
+ a list of r retention sites names with each death certificate (chosen at random when the death certificate is created)
Once t1 is reached, all servers but the servers in the retention list delete the death certificate
Dormant death certificates are deleted when t1 + t2 is reached
117p2p, Fall 05
Anti-Entropy with Dormant Death Certificates
Whenever a dormant death certificate encounters an obsolete data item, it must be “activated”
118p2p, Fall 05
How to choose partners
Consider spatial distributions in which the choice tends to favor nearby servers
Spatial Distribution
119p2p, Fall 05
Spatial Distribution
The cost of sending an update to a nearby site is much lower that the cost of sending the update to a distant site
Favor nearby neighbors
Trade off between: Average traffic per link and Convergence times
Example: linear network, only nearest neighbor: O(1) and O(n) vs uniform random connections: O(n) and O(log n)
Determine the probability of connecting to a site at distance d
For spreading updates on a line, d-2 distribution: the probability of connecting to a site at distance d is proportional to d-2
In general, each site s independently choose connections according to a distribution that is a function of Qs(d), where Qs(d) is the cumulative number of sites at distance d or less from s
120p2p, Fall 05
Spatial Distribution and Anti-Entropy
Extensive simulation on the actual topology with a number of different spatial distributions
A different class of distributions less sensitive to sudden increases of Qs(d)
Let each site s build a list of the other sites sorted by their distances from s
Select anti-entropy exchange partners from the sorted list according to a function f(i), where i is its position on the list
(averaging the probabilities of selecting equidistant sites)
Non-uniform distribution induce less overload on critical links
121p2p, Fall 05
Spatial Distribution and Rumors
Anti-entropy converges with probability 1 for a spatial distribution such that for every pair (s’, s) of sites there is a nonzero probability that s will choose to exchange data with s’
However, rumor mongering is less robust against changes in spatial distributions and network topology
As the spatial distribution is made less uniform, we can increase the value of k to compensate
122p2p, Fall 05
Replication II:
A Push&Pull Algorithm
Updates in Highly Unreliable, Replicated Peer-to-Peer Systems [Datta, Hauswirth, Aberer, ICDCS03]
123p2p, Fall 05
Replication in P2P systems
P-Grid
CAN
Unstructured P2P (sub-) network of replicas
How to update them?
124p2p, Fall 05
Problems in real-world P2P systems
• All replicas need to be informed of updates.
• Peers have low online probabilities and quorum can not be assumed.
• Eventual consistency is sufficient.
• Updates are relatively infrequent compared to queries.
• Metrics: Communication overhead, latency and percentage of replicas getting the update
Updates in Highly Unreliable, Replicated Peer-to-Peer Systems [Datta, HauswirthAberer, ICDCS03]
125p2p, Fall 05
Problems in real-world P2P systems (continued)
• Replication factor is substantially higher than what is assumed for distributed databases.
• Connectivity among replicas is high.
• Connectivity graph is random.
Updates in Highly Unreliable, Replicated Peer-to-Peer Systems [Datta, HauswirthAberer, ICDCS03]
126p2p, Fall 05
Updates in replicated P2P systems
P2P system’s search algorithm will find a random online replica responsible for the key being searched.
The replicas need to be consistent (ideally)
Probabilistic guarantee: Best effort!
Assumption: each peer knows a subset of the all replicas for an item
online
offline
127p2p, Fall 05
Updates in Highly Unreliable, Replicated Peer-to-Peer Systems [Datta, HauswirthAberer, ICDCS03]
Update Propagation combines
A push phase is initiated by the originator of the update that pushes the new update to a subset of responsible peers it knows, which in turn propagate it to responsible peers they know, etc (similar to flooding with TTL)
A pull phase is initiated by a peer that needs to update its copy. For example, because (a) it was offline (disconnected) or (b) has received a pull request but is not sure that it has the most up-to-date copy
Push and pull are consecutive, but may overlap in time
128p2p, Fall 05
Algorithms
Push:
If replica p gets Push(U, V, Rf, t) for a new (U, V) pair
Define Rp= random subset (of size R*fr) of replicas known to p
With probability PF(t): Push(U, V, Rf U Rp, t+1) to Rp \ Rf
Rf: partial list of peers that have received the update, R number of replicas, fr: fraction of the total replicas which peers initially decide to forward the update (fan-out)
Each message keeps the list of peers were the update has been sent
Parameters:
TTL counter t
PF(t) probability (locally determined at each peer) to send the update
|Rp|size of the random subset - fanout
Item, version, counter (similar to counters, when TTL
129p2p, Fall 05
Selective Push
1
2
2
3t
t
t+1
t+1
extra update message
avoid parallel redundant update:messages are propagated onlywith probability PF < 1 and toa fraction of the neighbors
1
2
2
t
t
t+1
extra update message
avoid sequential redundant update:partial lists of informed neighbors aretransmitted with the message
130p2p, Fall 05
Algorithms
Strategy: Push update to online peers asap, such that later, all online peers always have update (possibly pulled) w.h.p.
Pull:
If p coming online, or got no Push for time T
Contact online replicas
Pull updates based on version vectors
131p2p, Fall 05
Scenario1: Dynamic topology
1 2
45
3
7
6 9
8
132p2p, Fall 05
Scenario2: Duplicate messages
1 2
45
3
7
6 9
8
Necessary messages
Avoidable duplicates
Unavoidable (?) duplicates
133p2p, Fall 05
Results: Impact of varying fanout
How many peers learn about the
update
A limited fanout (fr) is sufficient to spread the update, since flooding is exponential. A large fanout will cause unnecessary duplicate messages
134p2p, Fall 05
Results: Impact of probability of peer staying online in consecutive push rounds
Sigma (σ) probability of online peers staying online in consecutive push rounds:
135p2p, Fall 05
Results: Impact of varying probability of pushing
Reduce the probability of forwarding updates with the increase in the number of push rounds
136p2p, Fall 05
CUP: Controlled Update Propagation in Peer-to-Peer Networks [RoussopoulosBaker02]
PCX: Path Caching with Expiration
Cache index entries at intermediary nodes that lie on the path taken by a search query
Cached entries typically have expiration times
Not addressed: which items need to be updated as well as whether the interest in updating particular entries has died out
CUP: Controlled Update Propagation
Asynchronously builds caches of index entries while answering search queries
+ Propagates updates of index entries to maintain these caches (pushes updates)
137p2p, Fall 05
CUP: Controlled Update Propagation in Peer-to-Peer Networks [RoussopoulosBaker02]
Every node maintains two logical channels per neighbor:
a query channel: used to forward search queries
an update channel: used to forward query responses asynchronously to a neighbor and to update index entries that are cached at the neighbor (to proactively push updates)
Queries travel to the node holding the item
Updates travel along the reverse path taken by a query
Query coalescing: if a node receives two or more queries for an item pushes only one instance
Just one Update Channel (does not keep a separate open connection per request) All responses go through the update channel: use interest bits so it knows to which neighbors to push the response
138p2p, Fall 05
CUP: Controlled Update Propagation in Peer-to-Peer Networks [RoussopoulosBaker02]
Each node decides individually:
When to receive updates
through registering its interest + an incentive-based policy to determine when to cut-off incoming updates
When to propagate updates
139p2p, Fall 05
CUP: Controlled Update Propagation in Peer-to-Peer Networks [RoussopoulosBaker02]
For each key K, node n stores
a flag that indicates whether the node is waiting to receive an update for K in response to a query
an interest vector: each bit corresponds to a neighbor and is set or clear depending on whether the neighbor is or is not interested in receiving updates for K
a popularity measure or request frequency of each non-local key K for which it receives queries
The measure is used to re-evaluate whether it is beneficial to continue caching and receiving updates for K
140p2p, Fall 05
CUP: Controlled Update Propagation in Peer-to-Peer Networks [RoussopoulosBaker02]
For each key, the authority node that owns the key is the root of the CUP tree
Updates originate at the root of the tree and travel downstream to interested nodes
Types of updates: deletes, refresh, append
Example: A is the root for K3
Applicable to both structured and unstructured
In structured, the query path is well-defined with a bounded number of hops
141p2p, Fall 05
CUP: Controlled Update Propagation in Peer-to-Peer Networks [RoussopoulosBaker02]
Handling Queries for K:
1. Fresh entries for key K are cached
use it to push the response to the querying neigborhood
2. Key K is not in cache
added and marked it as pending (to coalesce potential bursts)
3. All cached entries for K have expired
push the query
Handling Updates for K:
An update of K is forwarded only to neighbors have registered interest in K
Also, an adaptive control mechanism to regulate the rate of pushed updates
142p2p, Fall 05
CUP: Controlled Update Propagation in Peer-to-Peer Networks [RoussopoulosBaker02]
Adaptive control mechanism to regulate the rate of pushed updates
Each node N has a capacity U for pushing updates that varies with its workload, network bandwidth and/or network connectivity
N divides U among its outgoing update channels such that each channel gets a share that is proportional to the length of its queue
Entries in the queue may be re-ordered