View
229
Download
7
Category
Tags:
Preview:
Citation preview
Unity DemonstrationUnity DemonstrationUnity DemonstrationUnity Demonstration
Dr. Ramon LawrenceDr. Ramon LawrenceUniversity of IowaUniversity of Iowa
ramon-lawrence@uiowa.eduramon-lawrence@uiowa.edu
Dr. Ramon LawrenceDr. Ramon LawrenceUniversity of IowaUniversity of Iowa
ramon-lawrence@uiowa.eduramon-lawrence@uiowa.edu
Page 2
Outline Motivation and Background Two basic integration approaches:
global as view (GAV) local as view (LAV)
What is the open problem? How Unity is different Using Unity example Benefits and Contributions Future Work
Page 3
Motivation There are many integration environments:
Operational systems within an organization System integration during company merger Data warehouses, Intranets, and the WWW
Users require information from many data sources which often do not work together.
Page 4
What is Integration? Two levels of integration:
Schema integration - the description of the data Data integration - the individual data instances
Integration handles the different mechanisms for storing data (structural conflicts), for referencing data (naming conflicts), and for attributing meaning to the data (semantic conflicts).
Page 5
Two Current Approaches The current state-of-the-art integration systems all
can be reduced to a logical basis. For this demo, assume the data is physically stored in the
relational model and queried using Datalog.
There are two basic "database" approaches to integration:
global as view approach - the extraction and integration of data is defined simulatenously with the global view definition TSIMMIS using Object Exchange Model (OEM)
local as view approach - pre-defines the global view and then defines what portion of the global view each local source provides Information Manifold using description logic
BodyWorks Systems
Web Server
Custom Accounting
Package
ShipmentTrackingSoftware
Customer
OrderDatabase
InvoiceDatabase
ShipmentDatabase
BodyWorks Systems
Web Server
Custom Accounting
Package
ShipmentTrackingSoftware
Customer
OrderDatabase
InvoiceDatabase
ShipmentDatabase
Question: Who has a complete picture of a customer's order, or the entire customer relatioship?
BodyWorks Systems
Web Server
Custom Accounting
Package
ShipmentTrackingSoftware
Customer
OrderDatabase
InvoiceDatabase
ShipmentDatabase
Answer: No one, but management wants to know...
Data Warehouse Approach
OrderDatabase
InvoiceDatabase
ShipmentDatabase
GatherRefine
AggregateStore
GatherRefine
AggregateStore
GatherRefine
AggregateStore
Warehouse
Features:- static, materialized view- performs data cleansing and aggregation- historical more than operational
Query-Driven Dynamic Approach
InvoiceDatabase
Cust(id,name,addr,city,state,cty)Order(oid,cid,odate)OrdProd(oid,pid,amt,pr)Prod(id,name,pr,desc)
OrderDatabase
ShipmentDatabase
Cust(id,name,addr,city,state,cty)Invoice(invId,custId,shipId,iDate)InvProd(invId,prodId,amt,pr)Prod(id,name,pr,desc)
Cust(id,name,addr,city,state,cty)Shipment(shipid,oid,cid,shipdate)ShipProd(shipid,prodid,amt)Prod(id,name,pr,desc, inv)
Wrapper Wrapper Wrapper
mediator
Features:- view dynamically built- data is extracted at query-time- still typically read-only
Page 11
Global as View Approach Define global objects by specifying how to extract
their information from the local sources.
Requires that the administrator defining the global view understand the semantics of every local data source.
Further, if the local views or global views must be changed for whatever reason (such as adding a new data source), the global view must be re-compiled.
Page 12
Global as View Example Tsimmis MSL example extracting customer info:
Equivalent SQL:
<f(I) customer {<id I> <name N> <addr A>}>@med :-customer {<id I> <name N> <addr A>}@invoiceDB
<f(I) customer {<id I> <name N> <addr A>}>@med :-customer {<id I> <name N> <addr A>}@orderDB
<f(I) customer {<id I> <name N> <addr A>}>@med :-customer {<id I> <name N> <addr A>}@shipmentDB
Union the results of the following 3 queries: (matching ids if possible)orderDB: SELECT * FROM customerinvoiceDB: SELECT * FROM customer shipmentDB: SELECT * FROM customer
Page 13
Global as View Example (2) Extract all orders with invoices and shipments:
Equivalent SQL: (if possible to query multiple databases)
<shipInvOrd {<shipment S> <invoice I> <order O>}>@med :- <shipment {<shipid S> <oid O>}@shipmentDB AND<order {<oid O>}>@orderDB AND<invoice {<invId I> <shipId S>}@invoiceDB
SELECT shipment.shipid, invoice.invId, order.oidFROM shipment, invoice, orderWHERE shipment.shipid = invoice.shipId AND
shipment.oid = order.oid
Page 14
Local as View Approach Pre-define an integrated global view that
encompasses the information present in all sources. For each local source, specify the local view as a
subset of the information available in the GV. Building the GV is typically not discussed. However,
LAV approach makes it easier to add/remove sources as GV does not have to be updated.
Query processing using LAV approach is more difficult than GAV approach as have to determine what information can be extracted from the views.
Page 15
Local as View Example Consider this global customer relation in the GV:
Assume that the order, shipment, and invoice databases only contains a customer record if the customer had an invoice, order ,or shipment respectively. Further, assume that only shipmentDB contains a customer address.
Local views of each source:
customer(id, name, addr)
orderView(C,N) :- customer(C,N)
invoiceView(C,N) :- customer(C,N)
shipView(C,N,A) :- customer(C,N,A)
Page 16
Local as View Example (2) Let the user pose the following query:
Query asks for all customer names. Query processor must determine which views are relevant
(in this case all of them).
Local queries on each source:
q(N) :- customer(I, N, A)
q(N) :- orderView(C,N)
q(N) :- invoiceView(C,N)
q(N) :- shipView(C,N,A)
Page 17
What is the open problem? The two approaches are both viable methods for
solving data integration.
However, the open problem is that neither approach performs schema integration - the construction of the global view itself.
GAV - GV constructed (schema integration performed) by global designer when specifying extraction rules
LAV - GV is pre-defined using some previous integration process (most likely manual in nature)
Both methods rely on the concept of a global user to create the global schema.
Page 18
How Unity is Different Our integration architecture called Unity is different
because it approaches the integration problem for a different perspective:
Thus, the integration problem is tackled from a different set of starting assumptions:
Do not assume pre-existing or manually created GV. However, assume we have a dictionary and a language for
describing schema and data element semantics. Attempt to automatically build a GV from source descriptions
of each data source.
How can we automate, or semi-automate, the construction of the global view by extracting information from the local data sources?
Page 19
The Unity Approach Given a set of data sources and a dictionary and
language to describe data semantics: 1) Semi-automatically extract and represent data source
semantics in the language using the dictionary. 2) Automatically match concepts across data sources by
using the dictionary to determine related concepts. This process effectively builds the global level relations or objects
initially assumed or created in other approaches. However, since there is no manual intervention, the precision of
global view construction is affected by inconsistencies in the descriptions of the data sources and matching concepts.
3) Automatically generate queries specified by the user using dictionary terms (not structures) and map the user's query to appropriate data elements in the local sources.
Page 20
Unity Overview Unity is a software package that implements the
integration architecture with a GUI. Developed using Microsoft Visual C++ 6 and
Microsoft Foundation Classes (MFC).
Unity allows the user to: Construct and modify standard dictionaries Build X-Specs to describe data sources Integrate X-Specs into an integrated view Transparently query integrated systems using ODBC and
automatically generate SQL transactions
Page 21
Unity ExampleStep #1 - Standard Dictionary A standard dictionary (SD) provides standardized
terms to capture data semantics. Hierarchy of terms related by IS-A or HAS-A links Contains base set of common database concepts, but new
concepts can be added
A SD term is a single, unambiguous semantic definition.
Several SD entries for a single English word are required if the word has multiple definitions.
The top-level dictionary terms are those proposed by Sowa.
Page 23
Unity ExampleStep #2 - Data Extraction For each data source, an X-Spec document is
constructed that consists of: field, table, key, and join information extracted from the
ODBC source assignment of semantic names for each field and table
Semantic names combine dictionary terms to describe the semantics of schema elements.
semantic name := [CT_Type] | [CT_Type] PN CT_Type := CT | CT {; CT} | CT {,CT} CT := context term, PN := property name each CT and PN is a single term from the dictionary
Page 24
Unity ExampleStep #2 - Data Extraction (2) Semantic names are initially assigned using an
automatic algorithm which attempts to find the best matches.
The integrator can then refine initial semantic name assignments.
Semantic names have two major purposes: used as a means for describing, documenting, and
comparing concepts across systems allow information in the database (and later in the integrated
view) to be organized by semantic concept instead of using structures or relations This simplifies querying the database and integrated view because
the information is not divided in normalized relations.
Page 26
Unity ExampleStep #3 - Schema Building Unlike previous approaches, the global view (or
schema) is constructed automatically by combining source specifications (X-Specs).
This is possible because semantic naming of concepts allows matching across systems:
The same semantic name in two databases is assumed to represent the same concept.
Hierarchical nature of semantic names (consisting of multiple terms) allows a schema to be built-up from pieces of relations or objects from each data source.
Effectively, the global view is synthesized by the union of concepts in the underlying systems.
Page 28
Unity ExampleStep #4 - Query Processing The query processor:
Allows the user to formulate queries on the view. Translates from semantic names in the context view to
structural queries (SQL) on databases. Involves determining correct field and table mappings and
discovery of join conditions and join paths
Retrieves query results and formats them for display to the user.
Client-side query processing: Perform joins between databases using common keys.
Page 30
Benefits and Contributions The architecture automatically integrates relational
schemas into a global view for querying.
Unique contributions: Synthesizing a global view from the bottom-up instead of
top-down. This should improve integration scalability. Organizing the global view as a hierarchy of concepts
instead of relations or predicates simplifies querying similar to the Universal Relation as the user does not have to specify specific predicates/relations or join conditions.
Query processing is achieved by dynamically discovering extraction rules. The discovered rules are similar to extraction rules of GAV systems.
Page 31
Future Work Unity performs schema integration by extracting
data source information and performing global joins. However, the global query processor needs to be
extended to handle more diverse queries involving: aggregration and grouping, recursive queries, queries with
selection conditions that span data sources support for typical data integration problems of scaling, data type
conversions, and translation of units
Synthesizing the global view by combining concepts can be improved by exploiting dictionary knowledge:
Use IS-A relationships in dictionary to improve matching. Determine when to create new global level attributes and
contexts that are discovered based on interschema relationships.
Page 32
References Publications:
Unity - A Database Integration Tool, R. Lawrence and K. Barker, TRLabs Emerging Technology Bulletin, Jan. 2000.
Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pages 127-136, Oct. 2000.
Integrating Relational Database Schemas using a Standardized Dictionary, SAC’2001 - ACM Symposium on Applied Computing, pages 225-230, March 2001.
Querying Relational Databases without Explicit Joins DASWIS 2001- International Workshop on Data Semantics in Web Information Systems (with ER'2001), Nov. 2001.
Further Information: http://www.cs.uiowa.edu/~rlawrenc/
Page 33
Extra Slides
Extra Slides...
Data Warehouse Approach
InvoiceDatabase
GatherRefine
AggregateStore
GatherRefine
AggregateStore
GatherRefine
AggregateStore
Warehouse
Cust(id,name,addr,city,state,cty)Order(oid,cid,odate)OrdProd(oid,pid,amt,pr)Prod(id,name,pr,desc)
OrderDatabase
ShipmentDatabase
Cust(id,name,addr,city,state,cty)Invoice(invId,custId,invDate)InvProdinvId,prodId,amt,pr)Prod(id,name,pr,desc)
Cust(id,name,addr,city,state,cty)Shipment(shipid,oid,cid,shipdate)ShipProd(shipid,prodid,amt)Prod(id,name,pr,desc, inv)
Integration Architecture
Architecture Components: 1) Integrated Context View
• user’s view of integration 2) X-Spec Editor
• stores schema & metadata• uses XML
3) Standard Dictionary• terms to express semantics
4) Integration Algorithm• combines X-Specs into integrated context view
5) Query Processor• accepts query on view• determines data source mappings and joins• executes queries and formats results
Local Transactions
X-Spec
X-Spec Editor
Standard Dictionary
Integration Algorithm
Integrated Context View
Query Processor and ODBC Manager
Database
Client
Subtransactions
Client
Multidatabase Layer
Database
X-Spec
Page 36
Architecture Components The architecture consists of four components:
A standard dictionary (SD) to capture data semantics SD terms are used to build semantic names describing semantics of
schema elements.
X-Specs for storing data semantics Database metadata and semantic names stored using XML
Integration Algorithm Matches concepts in different databases by semantic names. Produces an integrated view of all database concepts.
Query Processor Allows the user to formulate queries on the view. Translates from semantic names in integrated view to SQL queries
and integrates and formats results. Involves determining correct field and table mappings and discovery of
join conditions and join paths
Page 37
The integration architecture consists of three separate processes:
Capture process: independently extracts database schema information and metadata into a XML document called a X-Spec.
Integration process: combines X-Specs into a structurally-neutral hierarchy of database concepts called an integrated context view.
Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries (SQL) and the results are integrated and formatted.
Integration Processes
Page 38
Architecture Components: Dictionary vs. Knowledge Base The standard dictionary differs from a knowledge base
such as Cyc because: Not intended to be a general English dictionary or contain
knowledge facts about the world Dictionary is evolved as new terms are required Not all English words are used
Dictionary provides the systems with no “knowledge” Since no facts are stored, system cannot deduce new facts Dictionary terms are just semantic place holders, integrators determine
the semantics of the database not the system
Simplified organization Dictionary is organized as a tree for efficiency and simplicity in
determining related concepts
Re-use of terms Terms are re-used in semantic names
Recommended