1 Data Management in Large-scale P2P Systems Patrick Valduriez, Esther Pacitti Atlas group, INRIA and LINA University of Nantes, France

1

Data Management in Large-scale P2P Systems

Patrick Valduriez, Esther Pacitti

Atlas group, INRIA and LINAUniversity of Nantes, France

2/26

Motivations

P2P systems Decentralized control, large scale Low-level, simple services

File sharing, computation sharing, com. sharing

Distributed database systems High-level data management services

queries, transactions, consistency, security, etc.

Centralized control, limited scale P2P + distributed database

Why? How?

3/26

Why high-level P2P data sharing?

Professional community example Medical doctors in a hospital may

want to share (some of) their patient data for an epidemiological study

They have their own, independent patient descriptions

They want to ask queries such as “age and weight of male patients diagnosed with disease X …” over their own descriptions

They don’t want to create a database and buy a server

4/26

Problem definition

P2P system No centralized control, very large scale Very dynamic: peers can join and leave

the network at any time Peers can be autonomous and unreliable

Techniques designed for distributed data management no longer apply Too static, need to be decentralized,

dynamic and self-adaptive

5/26

Outline

Data management in distributed systems

P2P systems Data management in P2P systems Data management in APPA

6/26

Data management basic principle

Data independence Hide implementation

details Provision for high-

level services Schema Queries (SQL, XQuery) Automatic

optimization Transactions Consistency Access control …

Logical view(schema)

Storage

Application Application

Storage

7/26

Distributed database system (DDBS)

Distribution transparency Global schema

Common data descriptions Distributed data placement

Centralized control through global catalog

Distributed functions Schema mapping Query processing Transaction management Access control Etc.

DistributedDatabaseSystem

DBMS1 DBMS2

Queries, Transactions

Site 1

Site 3Site 2

8/26

Scaling up DDBS

Distributed database systems Enterprise information systems Scale up to tens of databases

Data integration systems strong heterogeneity and autonomy of data

sources (files, databases, XML documents, ..) Limited functionality (queries) Scale up to hundreds of data sources

Parallel database systems Focus on high-performance and high-

availability Strong homogeneity Scale up to hundreds of data nodes

9/26

A generic P2P system

A user at a peer may access sharable data at remote peers

private

sharable

P2P software

private

sharable

P2P software

private

sharable

P2P software

10/26

Potential benefits of P2P systems

Scale up to very large numbers of peers

Dynamic self-organization Load balancing Parallel processing High availability through massive

replication

11/26

P2P vs DDBS

P2P DDBS

Joining the network

Upon peer’s initiative

Controled by DBA

Queries No schema, key-word based

Global schema, static optimization

Query answers Partial Complete

Content location

Using neighborsor DHT

Using directory

12/26

Requirements for P2P data management (1)

Autonomy of peers Peers should be able to join/leave at

any time, control their data wrt other (trusted) peers

Query expressiveness Key-lookup, key-word search, SQL-like

Efficiency Efficient use of bandwidth, computing

power, storage

13/26

Requirements for P2P data management (2)

Quality of service (QoS) User-perceived efficiency:

completeness of results, response time, data consistency, …

Fault-tolerance Efficiency and QoS despite failures

Security Data access control in the context of

very open systems

14/26

P2P network topologies

Unstructured systems e.g. SETI@home

Structured (DHT) systems e.g. CAN, CHORD

Super-peer (hybrid) systems e.g. Napster

15/26

P2P unstructured network

High autonomy (peer needs to know neighbor to login) Searching by flooding the network

general, inefficient High-fault tolerance with replication

p2p

datapeer 1

p2p

datapeer 2

p2p

datapeer 3

p2p

datapeer 4

16/26

P2P structured network

Efficient exact-match search O(log n) for put(key,value), get(key)

Limited autonomy since a peer is responsible for a range of keys

p2p

d(k1)peer 1

p2p

d(k2)peer 2

p2p

d(k3)peer 3

p2p

d(k4)peer 4

h(k1)= p1 h(k2)= p2 h(k3)= p3 h(k4)= p4

Distributed Hash Table (DHT)

17/26

Super-peer network

Super-peers can perform complex functions (meta-data management, indexing, acces control, etc.)

Efficiency and QoS Restricted autonomy SP = single point of failure => use several

p2sp

datapeer 1

p2sp

datapeer 2

p2sp

datapeer 3

p2sp

datapeer 4

sp2sp

sp2p

sp2sp

sp2p

18/26

P2P systems comparison

Requirements

Unstructured

DHT Super-peer

Autonomy high low avg

Query exp. high low high

Efficiency low high high

QoS low high high

Fault-tolerance

high high low

Security low low high

19/26

Data management in P2P systems

Current research focuses on Decentralized schema mappings

PeerDB: unstruct. network, keyword search only Extending DHT for complex querying

PIER : exact-match and join queries Query reformulation

Edutella: super-peer, RDF-based schemas Piazza: graph of pair-wise schema mappings

Replication generally limited to static read-only files P-Grid addresses updates in structured networks

20/26

Data management in APPA (Atlas P2P Architecture)

Objectives Scalability, availability and performance

Main features Network-independent architecture Layered, service-based architecture Replication with semantics-based reconciliation Decentralized schema management Schema-based query support and optimization Peer data caching

Prototype on JXTA Network-independent P2P services

21/26

Network independent APPA

Advanced Services

Query Processing

Replication Cache Management

Security

Basic Services

Group MembershipManagement

ConsensusManagement

P2P DataManagement

PeerManagement

PeerCommunication

P2P Network

Key-based Storage and RetrievalPeer ID Assignment Peer Linking

Internet

...

...

22/26

Different APPA architectures

Peer

Advancedservices

Basicservices

P2Pnetwork DHT

network

P2Pdata

localdata Peer

Basic servicesP2P network

Super-peer

Peer

Super-peerP2Pdata

localdata

Peer

Peer

Peer

Advancedservices

23/26

Schema management in APPA

Takes advantage of the collaborative nature of the applications Peers that wish to cooperate agree on

a Common Schema Description (CSD) Given 2 CSD relation definitions, an

example of peer mapping at peer p is: p:r(A,B,D) csd:r1(A,B,C), csd:r2(C,D,E)

Peer mappings stored as P2P data

24/26

Replication in APPA

Small-world assumption: peers work in smaller groups with time locality

Lazy multi-master replication n peers can update the same replica

Improves read performance and availability

Replica divergence solved by distributed log-based reconciliation

Exploit P2P data management service

25/26

Query processing in APPA

Given a SQL-like query on peer schema, performs query reformulation

Maps the query on CSD schemas query matching

Finds relevant peers query optimization

Selects best peers, taking replication into account

query decomposition and execution Exploits parallelism

26/26

Conclusion

Advanced P2P applications will need high-level data management services

Various P2P networks will improve Network-independence crucial to exploit

and combine them Many technical issues Important to characterize

applications that can most benefit from P2P wrt other distributed architectures

Documents

1 Data Management in Large-scale P2P Systems Patrick Valduriez, Esther Pacitti Atlas group, INRIA and LINA University of Nantes, France