Upload
loren-waters
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
1
Data Management in Large-scale P2P Systems
Patrick Valduriez, Esther Pacitti
Atlas group, INRIA and LINAUniversity of Nantes, France
2/26
Motivations
P2P systems Decentralized control, large scale Low-level, simple services
File sharing, computation sharing, com. sharing
Distributed database systems High-level data management services
queries, transactions, consistency, security, etc.
Centralized control, limited scale P2P + distributed database
Why? How?
3/26
Why high-level P2P data sharing?
Professional community example Medical doctors in a hospital may
want to share (some of) their patient data for an epidemiological study
They have their own, independent patient descriptions
They want to ask queries such as “age and weight of male patients diagnosed with disease X …” over their own descriptions
They don’t want to create a database and buy a server
4/26
Problem definition
P2P system No centralized control, very large scale Very dynamic: peers can join and leave
the network at any time Peers can be autonomous and unreliable
Techniques designed for distributed data management no longer apply Too static, need to be decentralized,
dynamic and self-adaptive
5/26
Outline
Data management in distributed systems
P2P systems Data management in P2P systems Data management in APPA
6/26
Data management basic principle
Data independence Hide implementation
details Provision for high-
level services Schema Queries (SQL, XQuery) Automatic
optimization Transactions Consistency Access control …
Logical view(schema)
Storage
Application Application
Storage
7/26
Distributed database system (DDBS)
Distribution transparency Global schema
Common data descriptions Distributed data placement
Centralized control through global catalog
Distributed functions Schema mapping Query processing Transaction management Access control Etc.
DistributedDatabaseSystem
DBMS1 DBMS2
Queries, Transactions
Site 1
Site 3Site 2
8/26
Scaling up DDBS
Distributed database systems Enterprise information systems Scale up to tens of databases
Data integration systems strong heterogeneity and autonomy of data
sources (files, databases, XML documents, ..) Limited functionality (queries) Scale up to hundreds of data sources
Parallel database systems Focus on high-performance and high-
availability Strong homogeneity Scale up to hundreds of data nodes
9/26
A generic P2P system
A user at a peer may access sharable data at remote peers
private
sharable
P2P software
private
sharable
P2P software
private
sharable
P2P software
10/26
Potential benefits of P2P systems
Scale up to very large numbers of peers
Dynamic self-organization Load balancing Parallel processing High availability through massive
replication
11/26
P2P vs DDBS
P2P DDBS
Joining the network
Upon peer’s initiative
Controled by DBA
Queries No schema, key-word based
Global schema, static optimization
Query answers Partial Complete
Content location
Using neighborsor DHT
Using directory
12/26
Requirements for P2P data management (1)
Autonomy of peers Peers should be able to join/leave at
any time, control their data wrt other (trusted) peers
Query expressiveness Key-lookup, key-word search, SQL-like
Efficiency Efficient use of bandwidth, computing
power, storage
13/26
Requirements for P2P data management (2)
Quality of service (QoS) User-perceived efficiency:
completeness of results, response time, data consistency, …
Fault-tolerance Efficiency and QoS despite failures
Security Data access control in the context of
very open systems
14/26
P2P network topologies
Unstructured systems e.g. SETI@home
Structured (DHT) systems e.g. CAN, CHORD
Super-peer (hybrid) systems e.g. Napster
15/26
P2P unstructured network
High autonomy (peer needs to know neighbor to login) Searching by flooding the network
general, inefficient High-fault tolerance with replication
p2p
datapeer 1
p2p
datapeer 2
p2p
datapeer 3
p2p
datapeer 4
16/26
P2P structured network
Efficient exact-match search O(log n) for put(key,value), get(key)
Limited autonomy since a peer is responsible for a range of keys
p2p
d(k1)peer 1
p2p
d(k2)peer 2
p2p
d(k3)peer 3
p2p
d(k4)peer 4
h(k1)= p1 h(k2)= p2 h(k3)= p3 h(k4)= p4
Distributed Hash Table (DHT)
17/26
Super-peer network
Super-peers can perform complex functions (meta-data management, indexing, acces control, etc.)
Efficiency and QoS Restricted autonomy SP = single point of failure => use several
p2sp
datapeer 1
p2sp
datapeer 2
p2sp
datapeer 3
p2sp
datapeer 4
sp2sp
sp2p
sp2sp
sp2p
18/26
P2P systems comparison
Requirements
Unstructured
DHT Super-peer
Autonomy high low avg
Query exp. high low high
Efficiency low high high
QoS low high high
Fault-tolerance
high high low
Security low low high
19/26
Data management in P2P systems
Current research focuses on Decentralized schema mappings
PeerDB: unstruct. network, keyword search only Extending DHT for complex querying
PIER : exact-match and join queries Query reformulation
Edutella: super-peer, RDF-based schemas Piazza: graph of pair-wise schema mappings
Replication generally limited to static read-only files P-Grid addresses updates in structured networks
20/26
Data management in APPA (Atlas P2P Architecture)
Objectives Scalability, availability and performance
Main features Network-independent architecture Layered, service-based architecture Replication with semantics-based reconciliation Decentralized schema management Schema-based query support and optimization Peer data caching
Prototype on JXTA Network-independent P2P services
21/26
Network independent APPA
Advanced Services
Query Processing
Replication Cache Management
Security
Basic Services
Group MembershipManagement
ConsensusManagement
P2P DataManagement
PeerManagement
PeerCommunication
P2P Network
Key-based Storage and RetrievalPeer ID Assignment Peer Linking
Internet
...
...
22/26
Different APPA architectures
Peer
Advancedservices
Basicservices
P2Pnetwork DHT
network
P2Pdata
localdata Peer
Basic servicesP2P network
Super-peer
Peer
Super-peerP2Pdata
localdata
Peer
Peer
Peer
Advancedservices
23/26
Schema management in APPA
Takes advantage of the collaborative nature of the applications Peers that wish to cooperate agree on
a Common Schema Description (CSD) Given 2 CSD relation definitions, an
example of peer mapping at peer p is: p:r(A,B,D) csd:r1(A,B,C), csd:r2(C,D,E)
Peer mappings stored as P2P data
24/26
Replication in APPA
Small-world assumption: peers work in smaller groups with time locality
Lazy multi-master replication n peers can update the same replica
Improves read performance and availability
Replica divergence solved by distributed log-based reconciliation
Exploit P2P data management service
25/26
Query processing in APPA
Given a SQL-like query on peer schema, performs query reformulation
Maps the query on CSD schemas query matching
Finds relevant peers query optimization
Selects best peers, taking replication into account
query decomposition and execution Exploits parallelism
26/26
Conclusion
Advanced P2P applications will need high-level data management services
Various P2P networks will improve Network-independence crucial to exploit
and combine them Many technical issues Important to characterize
applications that can most benefit from P2P wrt other distributed architectures