Upload
casper
View
27
Download
0
Embed Size (px)
DESCRIPTION
Making Peer Databases Interact – A Vision for an Architecture Supporting Data Coordination. Working Group (in alph. order): Bernstein Phil (4) Kementsietsidis Tasos (2) Kuper Gabriel (1) Mylopoulos John (2) Serafini Luciano (3) Shvaiko Pavel (1) Zaihrayeu Ilya (1) Sites: - PowerPoint PPT Presentation
Citation preview
Making Peer Databases Interact
–A Vision for an Architecture
Supporting Data Coordination
Working Group(in alph. order):
Bernstein Phil (4)
Kementsietsidis Tasos (2)
Kuper Gabriel (1)
Mylopoulos John (2)
Serafini Luciano (3)
Shvaiko Pavel (1)
Zaihrayeu Ilya (1)
Sites:
(1) University of Trento(2) University of Toronto(3) ITC-Irst, Trento(4) Microsoft Research
Fausto Giunchiglia (1)
Madrid, 20 September 2002
2
The Talk
Peer-to-Peer Databases – The intuition
Preliminary Logical Architecture
The Running Example
Conclusion
… and Agents???
3
PEER-TO-PEER DATABASES –THE INTUITION
4
The Peer-to-Peer (P2P)
“Peer-to-peer is a class of applications that take advantage of resources – storage, cycles, content, human presence – available at the edges of the Internet. Because accessing these decentralized resources means operating in an environment of unstable connectivity and unpredictable IP addresses, peer-to-peer nodes must … have significant or total autonomy of central servers”
Quote from Clay Shirkey (www.shirky.com)
5
Examples of P2P Computing
Napster – a shared directory of available music and client software which allows, for instance, to import and export filesGnutella – a decentralized group membership and search protocol, mainly used for file sharingGroove – a system which implements a secure shared space among peers JXTA – which aims at creating a common platform that makes it simple and easy to build a wide range of distributed services and applications in which every device is addressable by a peer
Is there a place for databases?
6Motivating Example: Databases of Medical Patients
One patient may be described in several databases: pharmacist, family doctor and hospitalBut the databases can use different patient ID formats, disease descriptions, etcNevertheless they still may need to interoperateAt this point data integration may suffice, if the patient goes to the same doctor, pharmacist and hospital
When a patient is injured on a ski holiday in another country, yet more databases need to get involved
Complete integration is likely to be infeasibleBut dynamic integration of databases relevant to one patient could have high value
7
Data (base) Coordination“... Coordination is managing dependencies between interacting
databases”
Why is it different from data (base) Integration?
No statically maintained global schema many of the parameters (metadata) influencing the interaction among
peer databases are decided at run time, whereas Integration is made in design time
Change in content of a node does not affect the overall system performance
… and
For any given query, nodes coordinate in order to define and use the most “appropriate” (virtual) schema – this is crucial for dealing with the strong dynamics of a P2P network
8
The Three Variances
Data integration mechanisms for randomly acquainted databases become impractical
We have three kinds of unpredictable run time factors, which influence the answer to a given query in a P2P network:
Network (dependent) variance: the network changes over time
Database (dependent) variance: different databases, if asked the same global query will provide different answers
Query (dependent) variance: different queries, even if posed to the same database, will impose different points of view on the network
9
Good Enough Answers
In data coordination, it becomes hard to maintain a high quality level in the answers provided bythe P2P network
High quality data can flow among the databases preserving (at the best possible level of approximation) soundness and completeness
Good Enough Answer (intuition) – high quality level answer which serves its purposes given the amount of effort made in computing it
10
Example of a Good Enough AnswerWhen planning his vacation in Trentino, John goes to a local travel agency (TA)TA unluckily can not offer John anything from their own database Instead TA searches for single operators in the Trentino region (hotels, ski resorts, etc)TA starts communication sessions with some operatorsTA queries for the necessary information (e.g., prices, conditions, availability)
As long as, for instance, TA gets a hotel John likes, this is Good EnoughCompared to the Motivating Example, much lower quality data coordination will probably suffice
Cost: 150 $Avail: 05/01/03 – 15/01/03Services: …
11
Tuning Coordination Over Time
A lot of metadata needs to be produced and maintainedDue to the strong dynamics of a P2P network, this is a crucial and hard task to perform because:
A node will never know the full list of its peersA node will never know everything about its peers Its knowledge will be hard to maintain and will easily become
obsoleteThere is a need of tuning/improving, on each peer, the quality of the interaction (for instance, with the help of learning algorithms, metadata editors, and so on)
There is an obvious trade-off between the quality of the answers and the effort made in maintaining coordination
12
VERY PRELIMINARY HINTS OF A LOGICAL ARCHITECTURE
13
A Proposed Architecture
Four basic ingredients:
1. Interest Groups
2. Acquaintances
3. Coordination Rules
4. Correspondence Rules
14
Interest Groups
Peer nodes know very little of the other nodes of the P2P network, and about the topics (e.g., Tourism, Medical care, …) their peers are able to answer queries
An Interest Group is a set of nodes which are able to answer queries about a certain topic
There is a Group Manager (GM) which is in charge of the management of the metadata needed in order to run the group
The main goal of GM is to compute the Query Scope (QS) – the set of nodes a query should be propagated to
15
Acquaintances
Acquaintances are nodes that a node knows about and that have data relevant to answer specific queries
A node is an acquaintance of another node only with respect to (possibly, a schematic representation of) a query
There must be a way to compute how to propagate a query, to propagate results back, and to reconcile them with the results coming from the other acquaintances
16
Coordination Rules
Each acquaintance may be associated with one or more Coordination Rulescoordination rules specify under what conditions, when, how and where to propagate queries or updatesA proposed implementation of coordination rules is as Event-Condition-Action (ECA) rules
Event can be an update or a query coming from the user or from another node
Condition refers to properties of the update or query (e.g., the type of query and/or which data are referenced by the query)
Action can be the translation and propagation of a given update or query to a particular acquaintance
17
Correspondence Rules
Each acquaintance is associated with one or more Correspondence Rules
Correspondence Rules translate queries and query results (semantic heterogeneity)
Implemented as rewrite rules and are called by coordination rules, in action and condition components
They can be used, for instance to translate attribute or element names (Domain Relations)
18
Level One ArchitectureP2P Layer
P2P functionality’s add-onLocal Data Source
DatabaseFile systemWeb site…
User InterfaceUser queriesResults…
Query Manager and Update Manager
Responsible for query and update propagationManage coordination and correspondence rules, acquaintances, and interest groups
Wrapperprovides a translation layer between QM and UM, and LDS
19
A Proposed Strategy for Query Propagation1. User submits query Q ()2. Node defines query topic3. Node sends to Group Manager (GM)
request to define Query Scope (QS)4. GM computes and sends back QS5. Node 1 sends query to acquaintances
in QS, and reports this fact to GM6. Nodes 2 and 4 send answer to node 17. Nodes propagate the query to theirs
acquaintances from QS and report this fact to GM
8. And so on…9. Nodes which do not propagate any
further, report this fact to GM10. Propagation stops when “no more
propagation” received from all boundary nodes
1
2
3
4
6
5
10
8
7
9
11
1. Q ()2. Q (, topic)
3. QS (, topic) = ? GM
4. QS (, topic)= (2, 4, 6, 8, 9, 11)
5. “nodes 2 and 4 are reached”
←Res
2
←Res4
“node 6 is reached”
“node 8 is reached”
“no more propagation
from 8”
“no more propagation
from 9”
20
THE RUNNING EXAMPLE
21
“Toy” Databases
Recall Motivating Example:Family Doctor DBF: Prescription (PatID, P_Name, Illness_Desc, StartDate,
RecoveryDate, Treatment, Type, Prescriptions);Hospital DBH: Patients (PID, Name, Disease, Treatment_Desc, In, Out);Medical Office DBM: Accidents (P_id, FN, LN, Address_Reason, Treatment_Taken,
Prescription_Given, Date)John, who suffers the accident, is described in H with ID “P12”, in F as “8”, and, when addressed to M, he is assigned ID “A13”
22
Query Example
Lets suppose QM is asked to M:Select FN, LN, Address_Reason,
Treatment_Taken, Prescription_Given, Date
From “M:Accidents”Where Address_Reason Like
(‘%Fracture%’ Or ‘%Dislocation%’) And PID = ‘A13’
With the indication QM is a global query with topic
T = “Medical Care in Canada”After some search T is matched with the topic “Medical Care in Toronto” of the interest group G
23
Group G
H is acquainted with F and P is acquainted with F; dashed lines are group metadata channels; H is GM of GGM computes query scope QS = G = {F, H, P} for query QM
M gets acquainted with HM: Accidents and H: Patients are matchedAs the result a set of Coordination Rules is generated
24
Examples of Coordination Rules
Coor # 1Event: M:QCondition: Q:(Address_Reason Select OR
Treatment_Taken Select) AND (PID = ‘A13’ Where)
Action: Q = Apply (Q, Corr_Rules_Query)Send (Q, H)
Coor # 2Event: M:RH
Condition: None Action: RM = Apply (RH,
Corr_Rules_Results)Where Corr_Rules_Query and Corr_Rules_Results are correspondence rules which translate outgoing query and incoming results
25
Query Propagation
P is not reachable because there is no acquaintance graph from M to PIn the graph the following queries are circulating:
QH = Select Name, Disease, Treatment_DescFrom “H:Patients”Where Disease Like (‘%Fracture%’ Or ‘%Dislocation%’) And PID = ‘P12’
QF = Select P_Name, Illness_Desc, TreatmentFrom “F:Prescriptions”Where Illness_Desc Like (‘%Fracture%’ Or ‘%Dislocation%’) And PID = ‘8’
26
Results Propagation and Reconciliation
H and F generate the following results:
ResH =
<’John’, ‘Forearm dislocation’, ‘Bandage’> ResF =
<’John’, ‘Leg fracture’, ‘Leg put in plaster’> When reached M, the results are reconciled as follows:
ResM =
<’John’, ‘Forearm dislocation’, ‘Bandage’><’John’, ‘Leg fracture’, ‘Leg put in plaster’>
27
Variance and Good Enough Answers
Good Enough answersResM is incomplete, some fields from H: Patients and F: Prescription are
missingNevertheless the results are good enough because they still serve
the needs of MNetwork Variance
If F is down, the results are even more incompleteDatabase Variance
If M gets acquainted with F instead of H – only ResF is retrievable. F has a different “vision” of the world, as it is not acquainted with HQuery Variance
If in QM ID of John is substituted by ID of another, not shared patient, then no Coordination Rules and therefore no propagation
28
Conclusion
First investigation of how to make databases interact in a P2P network. There are four main dimensions:
We must integrate data coming from autonomous, most often semantically heterogeneous, databases;
We must deal with network, database, and query variance. This is why we talk of data coordination, as distinct from data integration;
We will almost never get correct and complete answers. We must be content with answers which are good enough;
There is a need to tune metadata. This is requires in order to cope with the dynamics of a P2P network.
29
References
Project website: http://www.dit.unitn.it/~p2p/
“Data Management for Peer-to-Peer Computing: A Vision”, WebDB 2002, P. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. Mylopoulos, L. Serafini, and I. Zaihrayeu
L. Serafini, F. Giunchiglia, J. Mylopoulos and P. Bernstein “The Local Relational Model: Model and Proof Theory”, tech. rep. IRST, Trento