Upload
elvin-jackson
View
213
Download
0
Embed Size (px)
Citation preview
Piazza: Data Management Infrastructurefor the Semantic Web
Zachary G. IvesUniversity of Pennsylvania
CIS 700 – Internet-Scale Distributed Computing
February 3, 2004
Joint work with Alon Halevy, Peter Mork, Dan Suciu,Igor Tatarinov, University of Washington
2
The Big Question in P2P
Why use a P2P system vs. a centralized one? PRO: P2P offers greater flexibility and resource
utilization CON: P2P often sacrifices reliability guarantees,
accountability, and sometimes even performance
There are a few simple cases where P2P wins: Avoiding the law/RIAA/MPAA: copying music, videos,
etc. Anonymity (FreeNet, etc.) Exploiting idle cycles
But are there applications that are inherently P2P?
3
Most P2P Work is “Bottom-up”
The basis of P2P: Algorithms/data structures papers Chord, CAN, Pastry Focus on providing a robust DHT – not what to do with it
Several systems build functionality over the DHT: Tang et al. information retrieval paper
Maps LSI space into CAN multidimensional space Interesting but uncertain benefits
Berkeley PIER DB query engine: uses distributed hash table to do distributed
joins Sophia
Prolog rules in a distributed environment for network monitoring
None of these apps are inherently (or perhaps even best) based on P2P architectures
4
Thinking Top-Down
Find an application that has needs matching the properties of P2P: No central authority (and no logical owner of
central server) Loose, relatively ad hoc membership Capabilities of a system grow as new members
join Participants are generally cooperative
5
One Possible Answer: Data Integration/Interchange Applications
Multiple parties have proprietary data + sources Not willing to relinquish control or change their
data representation, but are willing to share Examples:
UPenn hospital system is looking to modernize information sharing among departments (trauma, neurology, etc.)
Many bioinformatics warehouses (e.g., Penn’s GUS, NCBI’s GeneBank) have related info they would like to share
The W3C’s vision of the “Semantic Web”: a web where all pages are annotated with meaning, meanings are well-defined, and complex questions can be answered
6
The “Old” Model: Centralization
Get all parties to hash out a standard, global schema or ontology Different classes of objects to be represented Constraints + relationships between them
Relate all of the data sources to that schema Relationships are specified as named queries –
views Efficient techniques exist for using these views
to answer future queries posed over the mediated schema
7
Data Integration System / Mediator
Centralized Data Integration Architecture
Mediated Schema
Wrapper Wrapper Wrapper
SourceData
Query-basedSchema Mappingsin Catalog
SourceCatalog
Query Results
8
Centralization Doesn’t Scale
Difficult to arrive at one standard schema … When we do, it’s slow to evolve to new needs This is a human factor, but it is also a scalability issue
Hard to leverage mappings well: If we map source A mediated schema, does this
help us map source B, even if source B is “almost” like source A?
Can we prevent mappings from “breaking” when we update the central schema?
Users often prefer familiar schema, not central one More schemas more users forced to change
schemas
9
The Piazza System: Infrastructure for Relating & Querying Structured Data
Recasts data integration as a decentralized confederation of peers and mappings
Our initial focus is on the logical aspects:1. Mediating between different types of XML-
encoded data Based on extensions of formalisms & techniques from
data integration Schemas are related via directional pairwise
mappings
2. Making maximal use of a limited number of mappings Translates queries over transitive closure of mappings Uses mappings “in reverse”
10
Mediated Query Answering in the Piazza System
UW Stanford
DBLP
Oxford
Leipzig
CiteSeer
Penn
Q
Q’
Q’Q’’
Q’’
Q’’
Q’’
Mappings typicallydirectional, pairwise
11
Data in Piazza
Each participant may have its own schema + data Unordered XML, with pre-
specified schemas
In general, we’ll identify it with XPath expressions: Similar syntax to Unix
paths, but over threes e.g.,
/rootelement/subelement/*
Root
?xml db
book
mdate key
author title year
pub
Brown92
Kurt Brown
PRPL…
1992
MKP
2002…
12
Mappings in Data Integration
Express value or class equivalence:DollarCost = EuroToDollar(EuroCost)“ID#0123456” “Catalog#98324”S1/book/author = S2/author
Also containment:S2/book S3/publication
Ability to use value(s) as IDsCollect all entries related to the ID into one object
Convert between edge labels, values<authors><book>1</book></authors>
<action type=“authorship”><object>book</object></action>
Concatenation: S2/author/fullname = S3/author/first + S3/author/last
13
Piazza’s Mapping Language
Goals: Build on XQuery and XML Remain computationally inexpensive Capture the common mapping types
Directional XML mapping language based on templates
<output> {: $var IN document(“doc”)/path WHERE condition :}
<tag>$var</tag></output>
Translates between parts of data instances Restricted subset of XQuery that’s decidable to reason
about Supports special annotations and object fusion
14
Mapping Example between XML Schemas
Target:pubs
book* title
author*
name
Source:authors
author* full-
name publication*
title pub-type
pub-type name
publication authorwrittenBy
title
15
Example Piazza Mapping
<pubs><book piazza:id={$t}>{: $a IN document(“…”)/authors/author, $an IN $a/full-name, $t IN $a/publication/title, $typ IN $a/publication/pub-type WHERE $typ = “book” PROPERTY $t >= ‘A’ AND $t < ‘B’ :}
<title>{$t}</title>
<author><name>{$an}</name></author></book>
</pubs>
16
Query Answering in Piazza
Given an XQuery over a schema, iteratively expand and translate it to capture neighbors at distance i Requires sophisticated reasoning to avoid cycles,
redundant expansions See paper for details
How does this work? Mapping defines constraints on pairs of source & target
instancesConstrains possible pairs of matched interpretations
Easy to use mapping in “forward direction”: query composition with a view (or chain of views)
Also have algorithms to rewrite query over source in terms of target
Need to invert mapping and compose that with query Answer set is defined by “certain” answers May lose some information in inversion
17
Piazza Is One of Several Similar Efforts
Peer-to-peer databases: PIER, PeerDB, Hyperion, [Bernstein et al. WebDB02], [Aberer et al. WWW03]
RDF engines and mediators for the Semantic Web: EDUTELLA, Sesame
Makes use of semi-automated mapping construction techniques from the database/machine learning communities: Clio, LSD, GLUE, Cupid, many others
18
Summary: Infrastructure for Decentralized Mediation
Powerful XML mappings and transformations Extensible, scalable architecture, thanks to
sophisticated reasoning techniques for mappings
The model itself is peer-to-peer at a logical level – functionality that is best suited to a P2P architecture
19
Where from Here? Ongoing Work
Piazza effort at U. Wash. continues to focus on problems relating to mappings
Orchestra at Penn follows up with a focus on two questions: What does a true DHT-based P2P integration system
look like? Covers a variety of query processing stages, including
mapping reformulation and query optimization, not just execution (as in PIER)
Where should we materialize or replicate data The “data placement” problem
What issues arise when we want to consider updates and synchronization at web-scale?
20
Data Management and P2P
We’ve now seen a number of approaches Information retrieval Network monitoring Query execution Decentralized data integration
Common themes: Declarative query languages separate logical + physical
levels Large amount of data with semantic info, distributed in many
sites
Which ideas hold the most promise? Is data management well-suited to P2P and DHTs?
Does data management need P2P?
22
Challenges with Mappings
Information may be lost in one direction of a mapping: Name := concat(FirstName, LastName) Faculty := Professors Lecturers
Correspondences may be hard to specify precisely: Bug ≈ Insect
Data may be dirty or incomplete Exact mappings may be computationally
expensive
23
RDF vs. XML
RDF explicitly names relationships:(book, title, “ABC”)(book, writtenBy, author)(author, name, “John Smith”)
XML does not always:1. <book>
<title>ABC</title> <writtenBy> <author><name>John Smith</name></author> </writtenBy></book>
2. <book> <title>ABC</title> <author>John Smith</author></book>
title name
book authorwrittenBy
24
RDF vs. XML 2
RDF is subject-neutral (a graph) XML centers around a subject (a tree):
1. <book> <title>ABC</title> <author>John Smith</author></book>
2. <author> <name>John Smith</name> <book>ABC</book></book>
This may result in duplication of contained objects
25
Mapping XML to OWL
We can map from XML to XML; thus we can go from XML to an XML serialization of RDF
Caveat: this doesn’t give us the full power of the KR-based Semantic Web! We can only create OWL individuals that can be
expressed in an XQuery-style view definition To go any further, we may need to supplement
these with additional OWL class definitions But it gets us 80% there and makes the rest much
easier – and it supplies mapping capabilities missing from OWL itself
26
Implementing the Semantic Web
Early emphasis on languages, tools for one (or a few) ontologies Very powerful solutions in OWL and tools! Initial assumption: data will have to be created in RDF
Important problems remain: sharing at scale and legacy data1. Global representations/ontologies hard to agree on!
Not just due to preference: different representations better suited to certain usage models – differences are inevitable
Need infrastructure that allows users to choose & query in their ontology, get results from all related (mapped) data
2. Must be able to import relevant structured data Most data is in existing, non-RDF formats
(XML, relations, legacy sources, etc.)
27
Impossible to Capture & NormalizeAll Semantics (1/2)
Even RDF/OWL regularity can’t enforce a single conceptual model:
May use different names for same items May use different levels of granularity:
book vs. publication Metadata + data may be interchanged:
(Car4, hasWheel, Wheel1) vs. (Car5, contains, Obj2), (Obj2, hasPurpose, wheel)
28
Impossible to Capture & NormalizeAll Semantics (2/2)
Even collections may be described differently:1.(Person, eatsForBreakfast, Meal1)
(Person, eatsForLunch, Meal2)(Person, eatsForDinner, Meal3)
2.(Person, eatsMeals, TodaysMeals)(TodaysMeals, breakfast, Meal1)(TodaysMeals, lunch, Meal2)(TodaysMeals, dinner, Meal3)
3.(Person, eatsMeals, list of Meal)(list of Meal := {Meal1, Meal2, Meal3})
29
Observations
Even formalisms like RDF, OWL capture only a part of the semantics Still need some interpretation (This shouldn’t be surprising, but it’s important!)
Very hard to get many contributors to agree on the same representation or ontology
Simple equivalences (owl:equivalentProperty, owl:equivalentClass) aren’t enough to map between different ontologies
Need infrastructure for relating data in many different representations, at different levels of granularity! This is the core strength of database techniques
30
Benefits of Piazza’s DB Heritage
Terabytes of existing data that’s in XML (or easily translatable to XML) Hierarchical and relational data, spreadsheets, Java
objects, … XML files, RDF itself!
Sophisticated reasoning about mappings is possible by extending existing data integration work Achieves schema/concept mapping at different
granularities Chaining of mappings, using mappings in reverse
direction, …Can map between data in different structures
(including RDF serializations, XML)
31
Key Problem: Coordinating Efforts between Collaborators
Today, to collaboratively edit structured data, we centralize
For many applications, this isn’t a good model, e.g.: Bioinformatics groups have multiple standard schemas
and warehouses for genomic information – each group wants to incorporate the info of the others, but have it in their format, with their own unique information preserved, and the ability to override info from elsewhere
Different neuroscientists have may data from measuring electrical activity in the same part of the brain – they may want to share common information but maintain their specific local information; each scientist wants the ability to control when their updates are propagated
Work-in-progress with Nitin Khandelwal; other contributors: Murat Cakir, Charuta Joshi, Ivan Terziev
32
The Orchestra System: Infrastructure for Collaborative Data Sharing
Each participant is a logical peer, with some XML schema that is mapped to at least one other peer’s schema
Schemas’ contents are logically synchronized initially and then on demand
Part1
Part2
Part3
mappings between XML schemas
mappings
Translatedupdates from 3:+ XML tree A’- XML tree B’
Updates:+ XML tree A- XML tree B
Translatedupdates from 3:+ XML tree A’’- XML tree B’’
Schema 2
Schema 3Schema 1
33
Some Challenges in Orchestra
Mappings How to express them Using them to translate updates, queries
Inconsistency How to represent conflicts How to resolve them
Update propagation Consistency with intermittent connectivity
Scaling To many updates To many queries
Logical &semantics-level
Implementation-level (P2P-based)
34
Mappings
Some peers may be replicas Others need mappings, expressed as “views”
Views: functions from one schema to another Can be inverted (may lose some information) Can be “chained” when there is no direct connection (Much research in generating these automatically [DDH00]
[MB01], …) Prior work on propagating updates through relational views
[BD82][K85][C+96]… Ensuring the mapping specifies a deterministic, side-effect-
free translation Algorithmically applying the translation
Ongoing work with Nitin Khandelwal: Extending the model to handle (unordered) XML Challenge: dealing with XML’s nesting and its repercussions
35
A Globally Consistent Model that Encodes Conflicts
Even in the presence of conflicts, want a “global state” (from perspective of some schema) when we synchronize Allows us to determine what’s agreed-upon, what’s conflicting Can define conflict resolution strategies
Goal: “union of all states” with a way of specifying conflicts Define conditional XML tree based on a subset of c-tables [IM84] Each peer pi has a boolean flag Pi representing “perspective i”
root
auth auth
Smith Lee
If P1If P2
36
Propagating Updates with Intermittent Connectivity
How to synchronize among n peers (even assuming the same schema)? Not all are connected simultaneously Usual approaches:
Locking (doesn’t scale) Epidemic algorithms (only eventually consistent)
Approach: “Shadow instance” of the schema,
replicated within the other peers of the network Everyone syncs with the shadow instance Benefits: state is deterministic after each sync
37
Scaling, Using P2P Techniques
Update synchronization Key problem: find values conflicting with “shadow instance” Partition the “shadow instance” across the network
Query execution Partition computation across multiple peers (PIER does this)
Query optimization Optimization breaks the query into sub-problems, uses
dynamic programming to build up estimates of the costs of applying operators
Can recast as recursion + memoization Use P2P overlay to distribute each recursive step Memoize results at every node
Why is this useful? Suppose 2 peers ask the same query!