View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Extracting Schema From Data
• The difference between schemas for semistructured data and traditional schemas is that a given semistructured data can have more than one schema .
• Given a semistructured data, compute automatically some schema for it, given several possible answers, we want the schema that best describes the structure of that particular data.This is called Schema Extraction.
•Schema Extraction for schema graphs
•Schema Extraction for Datalog Typings
Data Guides
Our goal is to construct a new OEM graph that is a finite description of the list of paths. This is called Data Guide. The two properties to be fulfilled:
•Accurate : Every path in the data occurs in the data guide, and every path in the data guide occurs in the data.
•Concise : Every path occurs exactly once.
&r
&p1 &p2 &p3 &p6 &p7 &p7&p4 &p5
&c
employee
employee
employee
company
worksfor worksfor
worksfor
name
namename name
namename name
name
name
“Widget Trenton”
managesmanages
manages manages manages
Managedby
Managedby
Managedby Managedby
Managedbyposition
positionpositionposition
phone
“Jones” “6666”“Smith”
“Joe”“Marketing” “Dupont”
“Gaston”“Salse” “Gonnet” “Jack” “IT”
“IT”
“Fred”
Figure7.13 An Example of OEM data
We proceed as follows: The Data Guide will have a root node, call it Root. Next we examine one by one each path in the list and add new nodes to the data guide, as needed:
•employee
•employee.name
•employee.manages
•employee.manages.managedby
•employee.manages.managedby.manages
•employee.manages.managedby.manages.managedby
•company
Root&r
Employees&p1,&p2,&p3,&p4&p5,&p6,&p7,&p8
Boss&p1,&p4,&p6
Regular&p2,&p3,&p5
&p7,&p8
Company&c
employee
company
worksfor
managedby
manages
name
name
name
nameworksfor
worksfor
manages
managedby
phone
phone
position
position
A Data Guide
Root
Emp
Comp
employee
company
name
name
worksfor
phoneposition
managedby
manages
Schema graph
Simulation between a data graph and a data guide
Node in data graph Node in data guide&r Root&p1, &p2, &p3, &p4, &p5, Employee&p6, &p7, &p8 &p1, &p4, &p6 Boss&p2, &p3, &p5, &p7, &p8 Regular&c Company
Simulation from the data guide to the schema graph
Node in data guide Node in schema graphRoot RootEmployee EmpBoss EmpRegular EmpCompany Comp
This construction of the data guide resembles the technique to transform a nondeterministic finite state automaton into a deterministic one .
The data guide is the most specific schema graph for that data with the following features:
•The data guide is a deterministic schema graph.
•Any other deterministic schema graph to which our data conforms subsumes the data guide.
Root&r
Regular &p2,&p3,&p5 &p7,&p8
Boss &p1,
&p4,&p6
manages
employee
managedby
employee
name
Comp &c
company
worksfor
namephone
worksfor
A nondeterministic schema
Extracting Datalog rules from data
We have a semistructured data instance and want to extract automatically the most specific typing given by a set of Datalog rules.
We create one predicate for each complex value object in the data. We create the following predicates:
pred_r, pred_c, pred_p1, pred_p2, pred_p3, pred_p4, pred_p5, pred_p6, pred_p7, pred_p8
corresponding to the objects &r, &c, &p1, &p2, &p3, &p4, &p5, &p6, &p7, &p8 .
Next we write a set of Datalog rules defining each predicate based exactly on the outgoing edges of its corresponding object:
pred_r(X) :- ref(X, company, Y), pred_c(Y),
ref(X, employee, Z1), pred_p1(Z1), ……
ref(X, employee, Z8), pred_p8(Z8)
pred_c(X) :- ref(X, name, N), string(N)
pred_p1(X) :- ref(X, worksfor, Y), pred_c(Y), ref(X,
name, N) , string(N), ref(X, phone, P),
string(P), ref(X, manages, Z), pred_p2(Z),
ref(X, manages, U) , pred_p3(U)
pred_p2(X) :- ref(X , worksfor, Y) , pred_c(Y),
ref(X, name, N) , string(N),
ref(X, manageby, Z), pred_p1(Z)
pred_p3(X) :- …..
……
We have to compute the largest fixpoint of the Datalog program on the given data.
Object Predicate
&r&c, &p1, &p2, &p3, &p4,&p5, &p6, &p7, &p8 &p1 &p2, &p3, &p5, &p7, &p8 &p3, &p5, &p7, &p8 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p3, &p5, &p7, &p8
pred_rpred_c
pred_p1pred_p2pred_p3pred_p4pred_p5pred_p6pred_p7pred_p8
Extents of predicates after one iteration
Object Predicate
&r&c, &p1, &p2, &p3, &p4,&p5, &p6, &p7, &p8 &p1 &p2, &p3 &p3 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p1, &p4, &p6 &p3, &p5, &p7, &p8 &p3, &p5, &p7, &p8
pred_rpred_c
pred_p1pred_p2pred_p3pred_p4pred_p5pred_p6pred_p7pred_p8
Extents of predicates after two iterations
We obtain the following Datalog rules:Root(X) :- ref(X, company, Y), Company(Y), ref(X, employee, Z1), Boss1(Z1), ref(X, employee, Z2), Boss2(Z2), ref(X, employee, U1), Regular1(U1),…….., ref(X, employee, U3), Regular3(U3)Company :- ref(X, name, N), string(N)Boss1(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, phone, P), string(P), ref(X, manages,Z), Regular1(Z), ref(X, manages, U), Regular2(U) Boss2(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, phone, P), string(P), ref(X, manages,Z), Regular3(Z)Regular1(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, managedby, Z), Boss1(Z)Regular2(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, position, P), string(P), ref(X, managedby, Z), Boss1(Z)Regular3(X) :- ref(X, worksfor, Y), Company(Y), ref(X, name, N), string(N), ref(X, position, P), string(P), ref(X, managedby, Z), Boss2(Z)
Inferring Schemas From Queries
Some semistructured data instances are the result of queries.
Query Result Schema Inferring
where bib -> L -> X, X - > “author” -> A, X -> “title” -> T, X -> “year” -> Y
create Root( ), HomePage(A), YearEntry(A,Y), PageEntry(X)
link Root() -> “person” -> HomePage(A), Homepage(A) -> “year” -> YearEntry(A,Y) YearEntry(A,Y) -> “paper” -> PaperEntry(X) PaperEntry(X) -> “title” -> T, PaperEntry(X) -> “author” -> HomePage(A), PaperEntry(X) -> “year” -> Y
The following query takes a bibliography file and constructs a homepage for every author:
Root
Homepage (“smith”)
Homepage (“Jones”)
YearEntry (“smith”,1995) YearEntry
(“smith”,1997)
YearEntry (“smith”,1997)
PaperEntry (o423)
PaperEntry (o552)
PaperEntry (o153)
personperson
authorauthor
author
author
title year title year title year
yearyear
year
paperpaper
paper
paper
Root
HomePage
YearEntry
PaperEntry
person
year
paper
title year
author
Schema graph inferred from the query
The schema will have one class for each function , and one edge for each line in the link clause.
where create Root( ), F(X), F(Y), G(X), H(Y)link Root( ) -> “A” -> F(X), F(X) -> “C” -> G(X),
Root( ) -> “B” -> F(Y), F(X) -> “D” -> H(Y)
For the following example:
We reach the following schema:
Root: {A : F, B : F}
F : {C : G, D : H}
Path Constraints
In Relational Databases
•in RDB, the relational declaration tell us more than the types•imposes a key constraint so that no two tuples have the same key
Example
Create table Employees ( Emp Id: integer, EmpName: char(30), DeptId: integer, … primary key(EmpId), foreign key(DeptId) references Departments )
Create table Departments( DeptID: integer, Dname: char(10), …… primary key(DeptId) )
In Object-Oriented Databases
Interface Publicationextent publication{ attribute String title; attribute Date date;
relationship set<Author> auth --->inclusion constraintsinverse Author::pub; --->inverse relationship
}Interface Author
extent author{ attribute String title; attribute String address;
relationship set<Publication> pub --->inclusion constraintsinverse Publication::auth; --->inverse relationship
}
Inclusion constrainsts:
•For any publication p, the set p.auth is a subset of the set author. Similarly, for any author a, the set a.pub is a subset of publication.
Inverse relationships:
•For any publication p, and for any author a in p.auth, p is a member of a.pub .
•For any author a, and for any publication p in a.pub, a is a member of p.auth .
publication publication authorauthor
auth auth auth
pubpubpub
title title datedate name name addressaddress
... ... ... ... ... ... ... ...
r
Illustration of path constraints on semistructured data
In semistructured data
•inclusion constraint is expressed as follows p (a (author(r,a) pub(a,p)) -> publication(r,p))
The general form of an inclusion constraint is x ((r,x)) -> (r,x))
• inverse relationship is p ( publication(r,p) -> a(auth(p,a) -> pub(a,p)))
The general form of this constraint is x ((r,x)) -> y((x,y)-> (y,x)))
Constraints are also important in Query Optimization. Here is an example:
Select row: P2from r.publication P1, r.publication P2, P1.auth Awhere “Database Systems” in P1.title and A in P2.auth
Select row: P’from r.publication P, P.auth A, A.pub P’where “Database Systems” in P.title
The query plan implicit in the first one requires two iterations over publication - with P1,P2 - whereas the second requests only one iteration - with P .