Making Peer Databases Interact – A Vision for an Architecture Supporting Data Coordination

Making Peer Databases Interact

–A Vision for an Architecture

Supporting Data Coordination

Working Group(in alph. order):

Bernstein Phil (4)

Kementsietsidis Tasos (2)

Kuper Gabriel (1)

Mylopoulos John (2)

Serafini Luciano (3)

Shvaiko Pavel (1)

Zaihrayeu Ilya (1)

Sites:

(1) University of Trento(2) University of Toronto(3) ITC-Irst, Trento(4) Microsoft Research

Fausto Giunchiglia (1)

Madrid, 20 September 2002

2

The Talk

Peer-to-Peer Databases – The intuition

Preliminary Logical Architecture

The Running Example

Conclusion

… and Agents???

3

PEER-TO-PEER DATABASES –THE INTUITION

4

The Peer-to-Peer (P2P)

“Peer-to-peer is a class of applications that take advantage of resources – storage, cycles, content, human presence – available at the edges of the Internet. Because accessing these decentralized resources means operating in an environment of unstable connectivity and unpredictable IP addresses, peer-to-peer nodes must … have significant or total autonomy of central servers”

Quote from Clay Shirkey (www.shirky.com)

http://www.shirky.com/

5

Examples of P2P Computing

Napster – a shared directory of available music and client software which allows, for instance, to import and export filesGnutella – a decentralized group membership and search protocol, mainly used for file sharingGroove – a system which implements a secure shared space among peers JXTA – which aims at creating a common platform that makes it simple and easy to build a wide range of distributed services and applications in which every device is addressable by a peer

Is there a place for databases?

6Motivating Example: Databases of Medical Patients

One patient may be described in several databases: pharmacist, family doctor and hospitalBut the databases can use different patient ID formats, disease descriptions, etcNevertheless they still may need to interoperateAt this point data integration may suffice, if the patient goes to the same doctor, pharmacist and hospital

When a patient is injured on a ski holiday in another country, yet more databases need to get involved

Complete integration is likely to be infeasibleBut dynamic integration of databases relevant to one patient could have high value

7

Data (base) Coordination“... Coordination is managing dependencies between interacting

databases”

Why is it different from data (base) Integration?

No statically maintained global schema many of the parameters (metadata) influencing the interaction among

peer databases are decided at run time, whereas Integration is made in design time

Change in content of a node does not affect the overall system performance

… and

For any given query, nodes coordinate in order to define and use the most “appropriate” (virtual) schema – this is crucial for dealing with the strong dynamics of a P2P network

8

The Three Variances

Data integration mechanisms for randomly acquainted databases become impractical

We have three kinds of unpredictable run time factors, which influence the answer to a given query in a P2P network:

Network (dependent) variance: the network changes over time

Database (dependent) variance: different databases, if asked the same global query will provide different answers

Query (dependent) variance: different queries, even if posed to the same database, will impose different points of view on the network

9

Good Enough Answers

In data coordination, it becomes hard to maintain a high quality level in the answers provided bythe P2P network

High quality data can flow among the databases preserving (at the best possible level of approximation) soundness and completeness

Good Enough Answer (intuition) – high quality level answer which serves its purposes given the amount of effort made in computing it

10

Example of a Good Enough AnswerWhen planning his vacation in Trentino, John goes to a local travel agency (TA)TA unluckily can not offer John anything from their own database Instead TA searches for single operators in the Trentino region (hotels, ski resorts, etc)TA starts communication sessions with some operatorsTA queries for the necessary information (e.g., prices, conditions, availability)

As long as, for instance, TA gets a hotel John likes, this is Good EnoughCompared to the Motivating Example, much lower quality data coordination will probably suffice

Cost: 150 $Avail: 05/01/03 – 15/01/03Services: …

11

Tuning Coordination Over Time

A lot of metadata needs to be produced and maintainedDue to the strong dynamics of a P2P network, this is a crucial and hard task to perform because:

A node will never know the full list of its peersA node will never know everything about its peers Its knowledge will be hard to maintain and will easily become

obsoleteThere is a need of tuning/improving, on each peer, the quality of the interaction (for instance, with the help of learning algorithms, metadata editors, and so on)

There is an obvious trade-off between the quality of the answers and the effort made in maintaining coordination

12

VERY PRELIMINARY HINTS OF A LOGICAL ARCHITECTURE

13

A Proposed Architecture

Four basic ingredients:

1. Interest Groups

2. Acquaintances

3. Coordination Rules

4. Correspondence Rules

14

Interest Groups

Peer nodes know very little of the other nodes of the P2P network, and about the topics (e.g., Tourism, Medical care, …) their peers are able to answer queries

An Interest Group is a set of nodes which are able to answer queries about a certain topic

There is a Group Manager (GM) which is in charge of the management of the metadata needed in order to run the group

The main goal of GM is to compute the Query Scope (QS) – the set of nodes a query should be propagated to

15

Acquaintances

Acquaintances are nodes that a node knows about and that have data relevant to answer specific queries

A node is an acquaintance of another node only with respect to (possibly, a schematic representation of) a query

There must be a way to compute how to propagate a query, to propagate results back, and to reconcile them with the results coming from the other acquaintances

16

Coordination Rules

Each acquaintance may be associated with one or more Coordination Rulescoordination rules specify under what conditions, when, how and where to propagate queries or updatesA proposed implementation of coordination rules is as Event-Condition-Action (ECA) rules

Event can be an update or a query coming from the user or from another node

Condition refers to properties of the update or query (e.g., the type of query and/or which data are referenced by the query)

Action can be the translation and propagation of a given update or query to a particular acquaintance

17

Correspondence Rules

Each acquaintance is associated with one or more Correspondence Rules

Correspondence Rules translate queries and query results (semantic heterogeneity)

Implemented as rewrite rules and are called by coordination rules, in action and condition components

They can be used, for instance to translate attribute or element names (Domain Relations)

18

Level One ArchitectureP2P Layer

P2P functionality’s add-onLocal Data Source

DatabaseFile systemWeb site…

User InterfaceUser queriesResults…

Query Manager and Update Manager

Responsible for query and update propagationManage coordination and correspondence rules, acquaintances, and interest groups

Wrapperprovides a translation layer between QM and UM, and LDS

19

A Proposed Strategy for Query Propagation1. User submits query Q ()2. Node defines query topic3. Node sends to Group Manager (GM)

request to define Query Scope (QS)4. GM computes and sends back QS5. Node 1 sends query to acquaintances

in QS, and reports this fact to GM6. Nodes 2 and 4 send answer to node 17. Nodes propagate the query to theirs

acquaintances from QS and report this fact to GM

8. And so on…9. Nodes which do not propagate any

further, report this fact to GM10. Propagation stops when “no more

propagation” received from all boundary nodes

1

2

3

4

6

5

10

8

7

9

11

1. Q ()2. Q (, topic)

3. QS (, topic) = ? GM

4. QS (, topic)= (2, 4, 6, 8, 9, 11)

5. “nodes 2 and 4 are reached”

←Res

2

←Res4

“node 6 is reached”

“node 8 is reached”

“no more propagation

from 8”

“no more propagation

from 9”

20

THE RUNNING EXAMPLE

21

“Toy” Databases

Recall Motivating Example:Family Doctor DBF: Prescription (PatID, P_Name, Illness_Desc, StartDate,

RecoveryDate, Treatment, Type, Prescriptions);Hospital DBH: Patients (PID, Name, Disease, Treatment_Desc, In, Out);Medical Office DBM: Accidents (P_id, FN, LN, Address_Reason, Treatment_Taken,

Prescription_Given, Date)John, who suffers the accident, is described in H with ID “P12”, in F as “8”, and, when addressed to M, he is assigned ID “A13”

22

Query Example

Lets suppose QM is asked to M:Select FN, LN, Address_Reason,

Treatment_Taken, Prescription_Given, Date

From “M:Accidents”Where Address_Reason Like

(‘%Fracture%’ Or ‘%Dislocation%’) And PID = ‘A13’

With the indication QM is a global query with topic

T = “Medical Care in Canada”After some search T is matched with the topic “Medical Care in Toronto” of the interest group G

23

Group G

H is acquainted with F and P is acquainted with F; dashed lines are group metadata channels; H is GM of GGM computes query scope QS = G = {F, H, P} for query QM

M gets acquainted with HM: Accidents and H: Patients are matchedAs the result a set of Coordination Rules is generated

24

Examples of Coordination Rules

Coor # 1Event: M:QCondition: Q:(Address_Reason Select OR

Treatment_Taken Select) AND (PID = ‘A13’ Where)

Action: Q = Apply (Q, Corr_Rules_Query)Send (Q, H)

Coor # 2Event: M:RH

Condition: None Action: RM = Apply (RH,

Corr_Rules_Results)Where Corr_Rules_Query and Corr_Rules_Results are correspondence rules which translate outgoing query and incoming results

25

Query Propagation

P is not reachable because there is no acquaintance graph from M to PIn the graph the following queries are circulating:

QH = Select Name, Disease, Treatment_DescFrom “H:Patients”Where Disease Like (‘%Fracture%’ Or ‘%Dislocation%’) And PID = ‘P12’

QF = Select P_Name, Illness_Desc, TreatmentFrom “F:Prescriptions”Where Illness_Desc Like (‘%Fracture%’ Or ‘%Dislocation%’) And PID = ‘8’

26

Results Propagation and Reconciliation

H and F generate the following results:

ResH =

<’John’, ‘Forearm dislocation’, ‘Bandage’> ResF =

<’John’, ‘Leg fracture’, ‘Leg put in plaster’> When reached M, the results are reconciled as follows:

ResM =

<’John’, ‘Forearm dislocation’, ‘Bandage’><’John’, ‘Leg fracture’, ‘Leg put in plaster’>

27

Variance and Good Enough Answers

Good Enough answersResM is incomplete, some fields from H: Patients and F: Prescription are

missingNevertheless the results are good enough because they still serve

the needs of MNetwork Variance

If F is down, the results are even more incompleteDatabase Variance

If M gets acquainted with F instead of H – only ResF is retrievable. F has a different “vision” of the world, as it is not acquainted with HQuery Variance

If in QM ID of John is substituted by ID of another, not shared patient, then no Coordination Rules and therefore no propagation

28

Conclusion

First investigation of how to make databases interact in a P2P network. There are four main dimensions:

We must integrate data coming from autonomous, most often semantically heterogeneous, databases;

We must deal with network, database, and query variance. This is why we talk of data coordination, as distinct from data integration;

We will almost never get correct and complete answers. We must be content with answers which are good enough;

There is a need to tune metadata. This is requires in order to cope with the dynamics of a P2P network.

29

References

Project website: http://www.dit.unitn.it/~p2p/

“Data Management for Peer-to-Peer Computing: A Vision”, WebDB 2002, P. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. Mylopoulos, L. Serafini, and I. Zaihrayeu

L. Serafini, F. Giunchiglia, J. Mylopoulos and P. Bernstein “The Local Relational Model: Model and Proof Theory”, tech. rep. IRST, Trento

http://www.dit.unitn.it/~p2p/

Documents

Making Peer Databases Interact – A Vision for an Architecture Supporting Data Coordination