View
225
Download
0
Tags:
Embed Size (px)
Citation preview
AutoJoin: Providing Freedom AutoJoin: Providing Freedom from Specifying Joinsfrom Specifying Joins
Terrence Mason ([email protected])Lixin Wang ([email protected])
Dr. Ramon Lawrence ([email protected])
Iowa Database and Emerging Application Laboratory
University of Iowa
7th International Conference on Enterprise Information Systems ICEIS 2005 Miami, Florida
Presentation OutlinePresentation Outline
Define Query InferenceDefine Query Inference Query Languages that require Query Languages that require
InferenceInference AutoJoin ArchitectureAutoJoin Architecture
Join Graph represent a schemaJoin Graph represent a schema Queries and Query Interpretations on a Join Queries and Query Interpretations on a Join
GraphGraph Pre-compute maximal join treesPre-compute maximal join trees
Algorithm EMOAlgorithm EMO Query time processing – ExampleQuery time processing – Example Performance EvaluationPerformance Evaluation
Query Inference Query Inference Problem Problem New LanguagesNew Languages
The The query inference query inference problemproblem requires requires
enumeratingenumerating and and rankingranking query interpretations of a query interpretations of a query such that the query such that the query query interpretation desiredinterpretation desired by by
the user is among the the user is among the highest rankedhighest ranked interpretations.interpretations.
State of the art query languages State of the art query languages require itrequire it Keyword SearchKeyword Search – automatically relate – automatically relate
keywords across relations of a schemakeywords across relations of a schema Conceptual QueriesConceptual Queries – Concepts mapped – Concepts mapped
to database must be relatedto database must be related Natural Language QueriesNatural Language Queries
Natural language query mapped to conceptsNatural language query mapped to concepts Relate concepts as in Conceptual Queries Relate concepts as in Conceptual Queries
Current approaches not scalable Current approaches not scalable Tied to specific language Tied to specific language Or conceptual modelOr conceptual model
Motivation for Query InferenceMotivation for Query Inference
Reduces to graph problem Reduces to graph problem Connect relations (nodes) with joins (edges)Connect relations (nodes) with joins (edges) Exponential solutions for highly connected graphs Exponential solutions for highly connected graphs
(database graphs less connected)(database graphs less connected) Approaches to join determinationApproaches to join determination
Grow all waysGrow all ways Universal Relation Universal Relation (Maier and Ullman, 1983)(Maier and Ullman, 1983) Discover (Keyword) Discover (Keyword) (Hristidis and Papakonstantinou, (Hristidis and Papakonstantinou,
2002, 2003, 2004)2002, 2003, 2004) Shortest PathsShortest Paths
CQL Conceptual Query Language CQL Conceptual Query Language (Owei and Navathe, (Owei and Navathe, 2001)2001)
Limited InterpretationsLimited Interpretations Steiner Tree (2-Trees) Steiner Tree (2-Trees) (Wald and Sorenson, 1984)(Wald and Sorenson, 1984) Limit number of joins and interpretations Limit number of joins and interpretations (Zhang et al., (Zhang et al.,
1999)1999) Query time find spanning trees of keywordsQuery time find spanning trees of keywords
DBXplorer Keyword Search (Agrawal et al. 2002)DBXplorer Keyword Search (Agrawal et al. 2002)
Motivation for Query InferenceMotivation for Query Inference
Goal of AutoJoinGoal of AutoJoin
Consistent, Scalable Inference Engine Abstract database schema from users Automatically determine joins to relate
relations and attributes Consistent approach to handle ambiguity in
queries Efficient algorithm to pre-compute potential
joins Minimal overhead at query time Demonstrate efficiency and scalability Structured on relational model without any
required conceptual models
Example Query on TPC-H Example Query on TPC-H SchemaSchema
English Query: List all parts ordered by Customers
in the United States.
Attribute-only SQL Determine Joins with AutoJoin New formulation for Query Inference
problem.
Table AttributesPart partkey, name, mfgr, brand, type, size, container, retailprice,
commentSupplier supkey, name, address, nationkey, phone, acctbal, commentPartSupp partkey, suppkey, availqty, supplycost, commentCustomer custkey, name, address, nationkey, phone, acctbal, mktsegment,
commentOrder orderkey, custkey, orderstatus, totalprice, orderdate,
orderpriority, clerk, shippriority, commentLineItem orderkey, partkey, suppkey, linenumber, quantity, extendedprice,
discount, returnflag, tax, linestatus, shipdate, commitdate, receiptdate, shipinstruct, shipmode, comment
Nation nationkey, name, regionkey, commentRegion regionkey, name, comment
TPC-H Schema
TPC-H BENCHMARK™ (http://www.tpc.org/)List all parts ordered by Customers in the United States.
Attribute-only Query: Select Part.Name where Nation.Name=‘United States’;
Part.Name - name attribute in Part Table Nation.Name – name attribute in Nation Table Select and where similar to SQL No From clause or joins specified
Keyword Query:Part ‘United States’
Maps Part to Part relation Maps ‘United States’ to tuple in Nation relation No joins specified
SQL QuerySQL Query Select Part.Name where Nation.Name =
‘United States’;
SELECT P.name FROM part P, nation N, partsupp PS, lineitem LI,
orders O, customer C WHERE N.name = ‘United States’
And P.partkey = PS.partkey And PS.partkey = LI.partkey And PS.suppkey = LI.suppkey And O.custkey = C.custkey And C.nationkey = N.nationkey And LI.orderkey = O.orderkey;
Specified
Joins and
Tables
User
Query Interface
Inference Request
Query Builder
Generator Ranker
Iterator
LoaderXML
DocumentAutoJoin Inference EngineAutoJoin Inference Engine
RelationalDatabase
Execute Queries
Interpretations
AutoJoin Architecture
Representing Joins of a Representing Joins of a SchemaSchemaJoin GraphJoin Graph
Graph representation of relational schema
Nodes Relations in schema
Directed Edges Foreign key constraint between relations
Edges directed from N to 1 cardinality of relationships
Maintain Lossless property (No spurious tuples on joins)
Create Join Graph TPC-HCreate Join Graph TPC-HNodes Joined
Foreign key/Join
Line Item to Part
partkey partkey
Line Item to PartSupp
partkey, suppkey partkey, suppkey
Line Item to Supplier
suppkey suppkey
Line Item to Order
l_orderkey o_orderkey
PartSupp to Part
ps_partkey p_partkey
PartSupp to Supplier
ps_suppkey s_suppkey
Supplier to Nation
s_nationkey n_nationkey
Order to Customer
o_custkey c_custkey
Customer to Nation
c_nationkey n_nationkey
Nation to Region
n_regionkey r_regionkey
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
Tables as
Nodes
Pre-compute Maximal Pre-compute Maximal Join TreesJoin Trees
EMO Algorithm on Join Graph Efficiently computes all Trees Executes where previous strategy failed Direction of edges results in lossless
join trees Pre-computed
Executed once prior to query time Structures built for query time
performance
Compute Lossless JoinsCompute Lossless Joins
Maximal sets of lossless joinsMaximal sets of lossless joins Ambiguity inherent in the schemaAmbiguity inherent in the schema Two types of ambiguity:Two types of ambiguity:
Single relation that plays Single relation that plays multiple rolesmultiple roles Node with more than one incoming edge in Node with more than one incoming edge in
join graphjoin graph Multiple semantic relationships between Multiple semantic relationships between
entitiesentities Strongly connected componentsStrongly connected components greater than greater than
one nodeone node
Creation of Maximal Join Creation of Maximal Join TreesTrees
Lossless JoinsLossless Joins Efficient Algorithm EMO
Determine all reachable graphs from nodes that may be a root for Maximal Set of Lossless Joins
Identify all Strong Connected Components (SCC)
For each SCC If SCC is single node and no incoming edges, create reachable graph from this node
If SCC has multiple nodes, for each node in SCC with no incoming edges that are not part of SCC create reachable graph.
For each reachable graph find all spanning trees
Spanning trees represent Maximal Join Trees
Maximal Join Trees of Maximal Join Trees of TPC-HTPC-H
LineItem is the only root for a reachable graph. No strongly connected components
Join graph is reachable graph Enumerate spanning trees on
original graph Remove shortcut joins and re-
compute
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
TPC-H Maximal Join Trees
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
1
876
2
5
3 4
Shortcut JoinsShortcut Joins
Semantically equivalent join paths A shortcut join is a join that is semantically
equivalent to a longer join path Core join path (longer) preserved in join
graph Shortcut join removed for join determination
Appears to be a semantically different interpretation of the query
Substituted back into query No nodes on core path in query (faster) execution)
TPC-H has two shortcut joins
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
TPC-H Join GraphRemove Shortcut Joins
Red – Shortcut Joins
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
Original TPC-H Maximal Join Trees
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
1
876
2
5
3 4
TPC-H Semantically Unique Maximal Join Trees
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
1 2
Query and Query Query and Query Interpretation AutoJoinInterpretation AutoJoin
Join Graphs Query:
Sub-graph of the join graph Nodes and (optionally) edges
Not connected requires inference
Query Interpretation: Connected sub-graph of the join graph Includes all specified nodes and edges
Example QueryExample Query
SELECT Part.Name SELECT Part.Name WHERE Nation.Name = ‘United States’;WHERE Nation.Name = ‘United States’;
Relate Part.Name to Nation.Name Part and Nation Nodes.
Query of Part and Nation nodes to AutoJoin. The query is ambiguous
More than one query interpretation Nation relates to Supplier and Customer
Return the query with fewest joins first
Efficient Query Time Efficient Query Time ExecutionExecution
Find maximal join trees with query nodes Reverse index - relation to its set of join trees Intersect lists
Build Interpretations Least common ancestor (vs. recursive prune) Pre-compute ancestor lists
No lossless interpretations (no trees) Find lossy interpretation
Rank interpretations by cost function
maximal sets of lossless joinsmaximal sets of lossless joins
Both Trees Contain Query NodesBoth Trees Contain Query NodesSelect Part.Name where Nation.Name = ‘United States’;Select Part.Name where Nation.Name = ‘United States’;
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
1 2
Red – Target Nodes
Query ProcessingQuery Processing
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
1 2
Red – Target Nodes
Blue – Tree Nodes
Gray – Nodes to Prune
Query Interpretations Query Interpretations
PartSupp
Nation
Part
Line Item
Order
Customer
PartSupp
Nation
SupplierPart
1 2
Select Part.Name where Select Part.Name where Customer.Nation.NameCustomer.Nation.Name = =
‘United States’;‘United States’;
Select Part.Name where Select Part.Name where Supplier.Nation.NameSupplier.Nation.Name = =
‘United States’;‘United States’;
Unambiguous QueryUnambiguous QuerySelect Supplier.Name where Order.Id = 73;Select Supplier.Name where Order.Id = 73;
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
1 2
Red – Target Nodes
Query ProcessingQuery ProcessingSelect Supplier.Name where Order.Id = 73;Select Supplier.Name where Order.Id = 73;
Red – Target Nodes
Blue – Tree Nodes
Gray – Nodes to Prune
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
PartSupp
Nation
SupplierPart
Line Item
Order
Customer
Region
1 2
Query Interpretations Query Interpretations Select Supplier.Name where Order.Id = 73;Select Supplier.Name where Order.Id = 73;
PartSupp
Supplier
Line Item
Order
1 2
PartSupp
Supplier
Line Item
Order
The Unambiguous Query Interpretation The Unambiguous Query Interpretation Select Supplier.Name where Order.Id = 73;Select Supplier.Name where Order.Id = 73;
PartSupp
Supplier
Line Item
Order
Additional InterpretationsAdditional InterpretationsLossy JoinsLossy Joins
Related through a node involved in two distinct roles Two maximal join trees contain all query nodes and
have at least one node in common Union maximal join trees Common nodes provide relation for trees. Interpretation where node will have two incoming
edges No longer lossless
Example Customer and Supplier related through Nation in TPC-H.
Cross products of Customers and Suppliers with the same nation
Beyond Natural JoinsBeyond Natural Joins
Theta joins Merge the two nodes related by theta join into
single node and re-compute maximal objects. Expand this node for final query interpretation
with theta join Tuple Variables
A query interface may specify tuple variables Additional nodes and edges will be added to
join graph to complete the query interpretations
Performance Performance ExperimentsExperiments
Broad Range of Schemas caBIO (NCI) 149 relations, 213 joins,
and 1253 maximal join trees TPC-H Standard Database
Inferred standard queries (21 specified queries)
Ambiguity reduced by removing shortcut joins
Tenant – 9 nodes, 50 joins, and 1286 maximal join trees
Peformance ResultsPeformance Results Time to generate all Maximal Join Trees
Handles schemas where previous method failed Worst test 2.7 seconds Average < 1 second
Reduce Ambiguity Removing shortcut joins reduces ambiguity Increased number of unambiguous query
From 45% to 68% for TPC-H Benchmark Queries Minimal overhead of inference at query
time Average < 1 millisecond Worst test 7.4 milliseconds
Compute Maximal Join Compute Maximal Join TreesTrees
EMO vs. All WaysEMO vs. All Ways
4.906
18.187
0.0930.1870.031 0.110.0472.652
0.7970.0790.0160
10
20
30
40
50
TPC-H (8) Claims (31) Tenant (1286) caBIO (1253) ACID (20) MONDIAL (117)
Schema (Maximal Objects)
Tim
e (S
eco
nd
s)
All Ways
EMO
∞
Reducing Ambiguity Reducing Ambiguity Remove Shortcut JoinsRemove Shortcut Joins
8%
45%
33%
26%
68%
100%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
TPC-H (All) TPC-H (Benchmark) EDS
Per
cent
Una
mbi
guou
s Jo
ins
Original
Shortcuts Removed
Query Inference TimeQuery Inference Time(Milliseconds)(Milliseconds)
0.355 0.282 0.3920.057 0.055 0.147
7.420
0.173
2.764
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
TPC-HQueries
TPC-HShortcutQueries
TPC-H All TPC-HShortcut All
ACID Claims Tenant MONDIAL caBIO
Ave
rag
e Q
uer
y T
ime,
mil
lise
con
ds
per
qu
ery
AutoJoin ConclusionsAutoJoin Conclusions
Scalable inference engine Efficiently pre-compute maximal
join trees Reduced ambiguity by removing
shortcut joins Overhead is minimal Complex queries can be inferred Built directly on relational model
Future WorkFuture Work
Develop a query language Develop a query language Remove requirement of Remove requirement of
understanding the underlying understanding the underlying schemaschema
Automatically determines joinsAutomatically determines joins End user interface based on End user interface based on
AutoJoin AutoJoin Query inference for integration Query inference for integration
systems.systems.
Query InferenceQuery Inference(Previous)(Previous)
The translation of a query The translation of a query in a query language into in a query language into
an unambiguous an unambiguous representation of the representation of the
query query [Wald and Sorenson, 1984][Wald and Sorenson, 1984]
Universal RelationUniversal Relation First model to require query inferenceFirst model to require query inference Maximal Objects (Maier and Ullman, 1983)Maximal Objects (Maier and Ullman, 1983)
Lossless Join property to identify potential joinsLossless Join property to identify potential joins Grows all waysGrows all ways on hyper-graph on hyper-graph Returns a union of all query interpretationsReturns a union of all query interpretations
Minimum Directed Cost Steiner Tree (Wald and Minimum Directed Cost Steiner Tree (Wald and Sorenson, 1984)Sorenson, 1984) Limited to Partial 2-TreesLimited to Partial 2-Trees Returns only lowest cost query interpretationReturns only lowest cost query interpretation
Generate a Generate a single interpretationsingle interpretation Do not meet need of new query languagesDo not meet need of new query languages Limited query interpretations possibleLimited query interpretations possible
State of the Art Query State of the Art Query LanguagesLanguages
Keyword SearchesKeyword Searches Keywords map to either Keywords map to either specificspecific data, data,
attribute names, or relation namesattribute names, or relation names in a in a database. database.
Must identify Must identify joins to relate keywordsjoins to relate keywords spread across multiple relations.spread across multiple relations.
Multiple approaches to identifying the Multiple approaches to identifying the top-ktop-k relationships between keywords. relationships between keywords.
Keyword SearchKeyword SearchTTop-K Relationshipsop-K Relationships
Discover (Hristidis and Discover (Hristidis and Papakonstantinou, 2002, 2003, 2004)Papakonstantinou, 2002, 2003, 2004) Grow all ways from a keywordGrow all ways from a keyword Limit on number of joinsLimit on number of joins Creates Creates extra graphsextra graphs
DBXplorer (Agrawal et al. 2002)DBXplorer (Agrawal et al. 2002) Generates Generates spanning trees at query timespanning trees at query time
BANKS ( )BANKS ( ) Graph of all tuplesGraph of all tuples related by joins related by joins Must fit in memory (limited to smaller Must fit in memory (limited to smaller
databases)databases)
State of the Art Query State of the Art Query LanguagesLanguages
Conceptual Query Languages or ModelsConceptual Query Languages or Models Queries built with Queries built with conceptsconcepts that that map to a map to a
database.database. Remove the burden of knowledge of the Remove the burden of knowledge of the
schema.schema. Must determine Must determine joins to relate conceptsjoins to relate concepts in in
query.query. Use conceptual model to determine joinsUse conceptual model to determine joins
Conceptual Query Conceptual Query LanguagesLanguages
CQL (Owei and Navathe, 2001)CQL (Owei and Navathe, 2001) Queries may include roles or joins required for Queries may include roles or joins required for
a querya query Pathfinder algorithm for completing the queryPathfinder algorithm for completing the query
Based on shortest path between source and target Based on shortest path between source and target concepts in queryconcepts in query
Semantically Constrained ER Diagram as a graph Semantically Constrained ER Diagram as a graph used to determine joins.used to determine joins.
Conceptual Model (Zhang et al., 1999)Conceptual Model (Zhang et al., 1999) Semantic graph of databaseSemantic graph of database Search algorithm constrained by number of Search algorithm constrained by number of
joins or number of interpretationsjoins or number of interpretations
State of the Art Query State of the Art Query LanguagesLanguages
Natural Language QueriesNatural Language Queries Natural language queries map the Natural language queries map the
language to concepts in a databaselanguage to concepts in a database Joins must be determined to relate Joins must be determined to relate
concepts in database similar to concepts in database similar to Conceptual Query LanguagesConceptual Query Languages
Functional Dependencies due to Primary KeysFunctional Dependencies due to Primary KeysTPC-HTPC-H
Table Functional Dependencies
Part p_partkey p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment
Supplier s_suppkey s_name, s_address, s_nationkey, s_phone, s_acctbal, s_comment
PartSupp ps_partkey, ps_suppkey ps_availqty, ps_supplycost, ps_comment
Customer c_custkey c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment
Order o_orderkey o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment
LineItem l_orderkey, l_linenumber l_partkey, l_suppkey, l_orderkey , l_quantity, l_extendedprice, l_discount, l_returnflag, l_tax, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment
Nation n_nationkey n_name, n_regionkey, n_comment
Region r_regionkey r_name, r_comment
Primary Keys Foreign Keys
Table with Foreign Key
Table Referenced
Functional Dependencies
LineItem Part l_partkey p_partkey
LineItem Supplier l_suppkey s_suppkey
LineItem PartSupp l_partkey, l_suppkey ps_partkey, ps_suppkey
LineItem Order l_orderkey o_orderkey
PartSupp Part ps_partkey p_partkey
PartSupp Supplier ps_suppkey s_suppkey
Supplier Nation s_nationkey n_nationkey
Customer Nation c_nationkey n_nationkey
Order Customer o_custkey c_custkey
Nation Region n_regionkey r_regionkey
Primary Keys Foreign Keys
Function Dependencies TPC-H Function Dependencies TPC-H implied by Foreign Keysimplied by Foreign Keys