Efficient Keyword Search across Heterogeneous Relational Databases

Mayssam Sayyadian, AnHai Doan

University of Wisconsin - Madison

Hieu LeKhac

University of Illinois - Urbana

Luis Gravano

Columbia University

Efficient Keyword Search acrossEfficient Keyword Search across Heterogeneous Relational DatabasesHeterogeneous Relational Databases

2

Key Message of PaperKey Message of Paper

Precise data integration is expensive But we can do IR-style data integration

very cheaply, with no manual cost!– just apply automatic schema/data matching – then do keyword search across the databases– no need to verify anything manually

Already very useful

Build upon keyword search over a single database ...

3

Keyword Search over Keyword Search over a Single Relational Databasea Single Relational Database

A growing field, numerous current works– DBXplorer [ICDE02], BANKS [ICDE02]– DISCOVER [VLDB02]– Efficient IR-style keyword search in databases [VLDB03], – VLDB-05, SIGMOD-06, etc.

Many related works over XML / other types of data– XKeyword [ICDE03], XRank [Sigmod03]– TeXQuery [WWW04]– ObjectRank [Sigmod06]– TopX [VLDB05], etc.

More are coming at SIGMOD-07 ...

4

A Typical ScenarioA Typical Scenario

tid custid name contact addr

t1 c124 Cisco Michael Jones …

t2 c533 IBM David Long …

t3 c333 MSR David Ross …

Customers

tid id emp-name comments

u1 c124 Michael Smith Repair didn’t work

u2 c124 John Deferred work to

John Smith

Complaints

Foreign-Key Join

Q = [Michael Smith Cisco]

Ranked list of answers

t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work

t1 c124 Cisco Michael Jones … u2 c124 John Deferred work to John Smith

score=.8

score=.7

5

Our Proposal:Our Proposal:Keyword Search across Multiple DatabasesKeyword Search across Multiple Databases

IR-style data integration

tid eid reports-to

x1 e23 e37

x2 e14 e37

Groups

tid empid name

Employees

v1 e23 Mike D. Smith

v2 e14 John Brown

v3 e37 Jack Lucas




t3 c333 MSR Joan Brown …

Customers




John Smith

Complaints


v1 e23 Mike D. Smith x1 e23 e37

v3 e37 Jack Lucasacross databases

Query: [Cisco Jack Lucas]

6

A NaA Naïïve Solutionve Solution

1. Manually identify FK joins across DBs

2. Manually identify matching data instances across DBs

3. Now treat the combination of DBs as a single DB

apply current keyword search techniques

Just like in traditional data integration,

this is too much manual work

7

Kite SolutionKite Solution

tid eid reports-to

x1 e23 e37

x2 e14 e37

Groups

tid empid name

Employees


v2 e14 John Brown

v3 e37 Jack Lucas




t3 c333 MSR Joan Brown …

Customers




John Smith

Complaints

Automatically find FK joins / matching data instances across databases

no manual work is required from user

8

Automatically Find FK JoinsAutomatically Find FK Joinsacross Databases across Databases

Current solutions analyze data values (e.g., Bellman) Limited accuracy

– e.g., “waterfront” with values yes/no “electricity” with values yes/no

Our solution: data analysis + schema matching– improve accuracy drastically (by as much as 50% F-1)

tid empid name

Employees


v2 e14 John Brown

v3 e37 Jack Lucas




John Smith

Complaints

Automatic join/data matching can be wrong

incorporate confidence scores into answer scores

9

Incorporate Confidence Scores Incorporate Confidence Scores into Answer Scoresinto Answer Scores

α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q)

size (A)score (A, Q) =

t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work score=.8

Recall: answer example in single-DB settings

Recall: answer example in multiple-DB settings


v1 e23 Mike D. Smith x1 e23 e37

v3 e37 Jack Lucasscore 0.9 for FK join

score 0.7 for data matching

10

Summary of Trade-OffsSummary of Trade-Offs

Precise data integration– the holy grail

SQL queries

IR-style data integration, naïve way– manually identify FK joins, matching data– still too expensive

IR-style data integration, using Kite– automatic FK join finding / data matching– cheap– only approximates the “ideal” ranked list found by naïve

11

Kite ArchitectureKite Architecture

Q = [ Smith Cisco ]

…

Distributed SQL queries

D1 Dn

Index Builder

Foreign key joins

IR index1 IR indexn…

… D1 Dn

Refinementrules

Offline preprocessing Online querying

Condensed CN Generator

Top-k Searcher

Foreign-Key Join Finder

Data-based Schema

Join Finder Matcher

Data instancematcher

– Partial

– Full

– Deep

12

Online QueryingOnline Querying

Database 2

Relation 1 Relation 2

Database 1

Relation 1 Relation 2

What current solutions do:

1. Create answer templates

2. Materialize answer templates to obtain answers

13

Create Answer Templates Create Answer Templates

Find tuples that contain query keywords– Use DB’s IR index

– example: Q = [Smith Cisco]

Tuple sets:

Create tuple-set graphSchema graph:

Tuple set graph:

Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1}

HR-DB: EmployeesQ={t1} GroupsQ={}

Customers Complaints Emps GroupsJ1 J4

J2

J3

Customers{} Complaints{} Emps{}

Groups{}

J1

CustomersQ ComplaintsQ EmpsQ

J1

J1

J1

J4

J4

J4

J4

J2J3

J3J2

Complaints

u1

u2

Service-DB

Groups

x1

x2

Employees

t1

t2

t3

HR-DB

Customers

v1

v2

v3

14

Create Answer Templates (cont.)Create Answer Templates (cont.)


Groups{}

J1


J1

J1

J1

J4

J4

J4

J4

J2J3

J3J2

sample tuple set graph

EmpsQ Groups{} Emps{} Complaints{Q}J2 J3 J4

CN4:

sample CNs


CN3:

CustomersQ Complaints{Q}CN2: J1

CustomersQCN1:

Search tuple-set graph to generate answer templates– also called Candidate Networks (CNs)

Each answer template = one way to join tuples to form an answer

15

Materialize Answer TemplatesMaterialize Answer Templatesto Generate Answersto Generate Answers

By generating and executing SQL queries

CN: CustomersQ ComplaintsQ (CustomersQ = {v1} , ComplaintsQ = {u1, u2})

SQL: SELECT * FROM Customers C, Complaints P

WHERE C.cust-id = P.id AND

(C.tuple-id = v1) AND

(P.tuple-id = u1 OR tuple-id = u2)

J1

Naïve solution– materialize all answer templates, score, rank, then return answers

Current solutions– find only top-k answers– materialize only certain answer templates– make decisions using refinement rules + statistics

16

Challenges for Kite SettingChallenges for Kite Setting

More databases way too many answer templates to generate– can take hours on just 3-4 databases

Materializing an answer template takes way too long– requires SQL query execution across multiple databases– invoking each database incurs large overhead

Difficult to obtain reliable statistics across databases

See paper for our solutions (or backup slides)

17

Empirical Evaluation Empirical Evaluation

Domain # DBsAvg # tables

per DB

Avg # attributes per

schema

Avg # approximate FK joins tuples per table

Avg # tuples per table

Total size

total across DBs per pair

DBLP 2 3 3 11 6 11 500K 400M Inventory 8 5.8 5.4 890 804 33.6 2K 50M

Domains

The DBLP Schema

CNF (id, name)

CITE (id1, id2)

AR (id, title)

AU (id, name)

AR (aid, biblo)

PU (aid, uid)

DBLP 1 DBLP 2

Sample Inventory Schema

WAREHOUSE

AUTHOR

BOOK

WH2BOOK

CD

ARTIST

WH2CD

Inventory 1

18

Runtime Performance (1)Runtime Performance (1)

Hybrid algorithm adapted to run over multiple databases

Kite without condensed CNs

Kite without adaptive rule selection and without rule Deep

Full-fledged Kite algorithm

Kite without rule Deep

0

60

120

180

1 2 3 4 5 6 7

tim

e (s

ec)

Inventory

0

60

120

180

1 2 3 4 5 6 7 8 9

tim

e (s

ec)

DBLP

maxCCNsize

maxCCNsize

2-keyword queries, k=10, 2 databases 2-keyword queries, k=10, 5 databases

0

15

30

45

1 2 3 4 5 6 7 8

tim

e (s

ec)

Inventory

runtime vs. # of databases

#of DBs

maximum CCN size = 4, 2-keyword queries, k=10

runtime vs. maximum CCN size

19

Runtime Performance (2)Runtime Performance (2)runtime vs. # of keywords in the query

|q|0

5

10

15

20

1 2 3 4 5

tim

e (s

ec) DBLP

max CCN=6, k=10, 2 databases

0

10

20

30

40

1 2 3 4 5

tim

e (s

ec) Inventory

|q|

max CCN=4, k=10, 5 databases

runtime vs. # of answers requested

0

15

30

45

1 4 7 10 13 16 19 22 25 27 30

0

15

30

45

1 4 7 10 13 16 19 22 25 27 30

tim

e (s

ec) Inventory

k

2-keyword queries, max CCN=4, 5 databases

tim

e (s

ec)

k

2-keyword queries, max CCN=4, |q|=2, 5 databases

20

Query Result QualityQuery Result Quality

0

0.2

0.4

0.6

0.8

1

1 5 10 15 200

0.2

0.4

0.6

0.8

1

1 5 10 15 20

Pr@k

k k

Pr@k

OR-semantic queries AND-semantic queries

Pr@k = the fraction of answers that appear in the “ideal” list

21

SummarySummary

Kite executes IR-style data integration– performs some automatic preprocessing– then immediately allows keyword querying

Relatively painless– no manual work!– no need to create global schema, nor to understand SQL

Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people– enterprises, on the Web, need only a few answers– emergency (e.g., hospital + police), need answers quickly

22

Future DirectionsFuture Directions

Incorporate user feedback interactive IR-style data integration

More efficient query processing– large # of databases, network latency

Extends to other types of data– XML, ontologies, extracted data, Web data

IR-style data integration is feasible and useful

extends current works on keyword search over DB

raises many opportunities for future work

23

BACKUPBACKUP

24

Condensing Candidate NetworksCondensing Candidate Networks

In multi-database settings unmanageable number of CNs– Many CNs share the same tuple sets and differ only in the associated joins– Group CNs into condensed candidate networks (CCNs)


Groups{}

J1


J1

J1

J1

J4

J4

J4

J4

J2J3

J3J2

sample tuple set graph

sample CNs


Groups{}`

J1

CustomersQ ComplaintsQ EmpsQJ1

J1

J1

J4

J4

J4

J4

{J2, J3}

{J2, J3}

condense tuple set graph

sample CCNs


CN4:


CN3: EmpsQ Groups{} Emps{} Complaints{Q}

J2{J2, J3} J4

Condense

Condense

25

Top-Top-kk Search Search Main ideas for top-k keyword search:

– No need to materialize all CNs– Sometimes, even partially materializing a CN is enough– Estimate score intervals for CNs, then branch and bound search

.

... Q [0.5, 0.7]

. ... P [0.6, 1]

R [0.4, 0.9]..

iteration 1

K = {P2, P3}, min score = 0.7

.

.

.. P1 [0.6, 0.8]

P2 0.9

. P3 0.7

R [0.4, 0.9]..

iteration 2

Res = {P2, R2}min score = 0.85.. R1 [0.4, 0.6]

. R2 0.85

iteration 3

Kite approach: materialize CNs using refinement rules

26

Top-Top-kk Search Using Refinement Rules Search Using Refinement Rules

• In single-database setting

selecting rules based on database statistics

• In multi-database setting Inaccurate statistics

• Inaccurate statistics Inappropriate rule selection

27

Refinement RulesRefinement Rules Full:

– Exhaustively extract all answers from a CN (fully materialize S) too much data to move around the network (data transfer cost)

Partial:– Try to extract the most promising answer from a CN invoke remote databases for only one answer (high cost of database

invocation)

Deep:– A middle-ground approach– Once a table in a remote

database is invoked, extract all answers involving that table

– Takes into account database invocation cost

TQ UQ

t1 u1 TQ UQ

t1 0.9t2 0.7t3 0.4t4 0.3

0.8 u10.6 u20.5 u30.1 u4

t1 0.9t2 0.7t3 0.4t4 0.3

0.8 u10.6 u20.5 u30.1 u4

TQ UQ TQ UQ

t1 0.9t2 0.7t3 0.4t4 0.3

0.8 u10.6 u20.5 u30.1 u4

t1 0.9t2 0.7t3 0.4t4 0.3

0.8 u10.6 u20.5 u30.1 u4

t1 u3t1 u1 ,

28

Adaptive SearchAdaptive Search Question: which refinement rule to apply next?

– In single-database setting based on database statistics– Multi-database setting inaccurate statistics

Kite approach: adaptively select rules

goodness-score (rule, cn) = benefit (rule, cn) – cost (rule, cn)

– cost (rule, cn): optimizer’s estimated cost for SQL statements– benefit (rule, cn): reduce the benefit if a rule is applied for a while

without making any progress

29

Other ExperimentsOther ExperimentsJoin Discovery Accuracy

0

0.2

0.4

0.6

0.8

1

Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5

Join Discovery Join Discovery + Schema Matching

accu

racy

(F

1)

0

2

4

6

1 2 3 4 5 6 7 8

Kite over single database

tim

e (s

ec)

max CCN size

Schema matching helps improve join discovery algorithm drastically

Kite also improves single-database keyword search algorithm mHybrid

Documents

Efficient Keyword Search across Heterogeneous Relational Databases