Boolean + Ranking: Querying a Database by K-Constrained Optimization

Preview:

DESCRIPTION

Boolean + Ranking: Querying a Database by K-Constrained Optimization. Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang. Information retrieval. Traditional databases. Ranking query: Top 5 ranked by gpa. Boolean query: - PowerPoint PPT Presentation

Citation preview

The Database and Info. Systems Lab.University of Illinois at Urbana-Champaign

Boolean + Ranking: Querying a Database by K-Constrained Optimization

Zhen ZhangJoint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang

AIM 2

Many queries naturally combine Boolean and ranking

Information retrieval

Ranking query:

Top 5 ranked by gpa

+Database applications on Web

Traditional databases

Boolean query:

dept = CS and year = 2

Qualifying constraint

Quantifying function R: gpa

B: dept = CS and year = 2

Find top answers

AIM 3

Motivating scenarios

Data retrieval: Find houses in certain price range with good

price/sqrft ratio

Data analysis: Find products with highest sale increase in

consecutive years

Select h.address from House h

Where h.price ≤ 200k ν h.price ≥ 400k

Order by h.size/|h.price-300k| Limit 1

Select h.address from House h, CrimeRate c

Where h.price ≤ 200k ν h.price ≥ 400k and h.zipcode = c.zipcode

Order by h.size/|h.price-300k| *c.crimerate-1 Limit 10

Select itemid from Sales s1, Sales s2

Where s1.itemid = s2.itemid and s2.year – s1.year = 1

Order by s2.sale – s1.sale Limit 10

AIM 4

Boolean + Ranking form a coherent goal function

Boolean B + Ranking R = Goal function G

For a tuple t

G(t) = B(t)*R(t) = R(t) if B(t) is true

0 if B(t) is false(ie, lowest score)

AIM 5

The nature of Boolean + Ranking is K-constrained optimization query Optimize goal function G over database D

h.size/|h.price-300k|

[h.price ≤ 200k ν h.price ≥ 400k ]

Addr Zip Price Size

1. Oak park, Chicago 60644 600K 4500

2. Mattis, Champaign 61821 350K 2000

3. … 150K 1000

4. … 250K 2000

5. … 300K 3500

6. … 80K 500

Goal function G

Database D

D

G

AIM 6

What is the query evaluation mechanism?

Ranking query+Boolean query

How to answer?

AIM 7

Current techniques lack of global search mechanism

If evaluated as separate operators

If search by an overall goal function G as a ranking

function

Boolean query B

………

Ranking query R

Current techniques restrict G to be monotonic

Current techniques optimize only condition-by-condition

D Boolean query B

Ranking query R

D RBGoal function G

AIM 8

Our thesis: Evaluate query as its nature suggests!

Optimize G over D

Function optimization

of GDiscrete state

search over D

G

D

D

OPT*

AIM 9

We view compound index as discrete space

Addr Zip Price Size

1. Oak park, Chicago 60644 600K 4500

2. Mattis, Champaign 61821 350K 2000

3. … 150K 1000

4. … 250K 2000

5. … 300K 3500

6. … 80K 500

AIM 10

250

3000

350

100

1500

4000

4500

600

We view compound index as discrete space

250-6000-250

100-2500-100 350-600250-350

52 1………

b1

b3b2

b7b6

3000-45000-3000

1500-30000-1500 4000-60003000-4000

5 1………

a1

a6

a3a2

a7

size

Price (k)

1

52

3 4

6

AIM 11

250

3000

350

100

1500

4000

4500

600

We view compound index as discrete space

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M56M75

154 2

250-6000-250

100-2500-100 350-600250-350

52 1………

b1

b3b2

b7b6

3000-45000-3000

1500-30000-1500 4000-60003000-4000

5 1………

a1

a6

a3a2

a7

size

Price (k)

1

52

3 4

6

Mij =(ai, bj)

……

AIM 12

250

3000

350

100

1500

4000

4500

600

We view compound index as discrete space

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M56M75

154 2

250-6000-250

100-2500-100 350-600250-350

52 1………

b1

b3b2

b7b6

3000-45000-3000

1500-30000-1500 4000-60003000-4000

5 1………

a1

a6

a3a2

a7

size

Price (k)

1

52

3 4

6

Mij =(ai, bj)

conceptually, combined space

AIM 13

How to perform the search in the space?

What is the search mechanism? How to conceptually view the index space of

D for search How to guide the search?

How to use function G to focus the search

AIM 14

Challenge 1: What is the search mechanism?

AIM 15

We encode as A* because it’s optimal

What A* is: Finding the shortest path Why we choose: Completeness and optimality with

proper heuristics Complete: guarantee to find shortest path Optimal: visit least number of nodes

origin

destination

5

2

96

3

5

1

1

7

AIM 16

Encoding our problem into shortest path is challenging

How to encode: a tuple a path? score of tuple distance of path?

K-constrained optimization

Find a tuple with maximal score

Shortest path

Find a path with minimal distance

AIM 17

Therefore, we encode K-constrained opt. as: How to encode a tuple to a path?

Adding a virtual target t* only reachable through tuples How to encode maximal tuple with minimal path?

Quality of path depends solely on the tuple it passes by For tuple state t

D(t, t*) = - G(t) For two states r, u

D(r, u) = 0

M55

M11

M22 M32 M23 M33

M66 M77 M67 M76M75 M56

154 2

t*

0

0

0

0

- G(4)- G(1)

0

0

AIM 18

Challenge 2: How to guide the search?

AIM 19

We use function opt. to sketch the landscape of G Function optimization measures quality of states Function optimization enables:

1. How to define heuristics? 2. How to configure space? 3. Where to start the search?

AIM 20

1. Define admissible heuristics: Measure tightest upper bound

H(region) = OPTMAX(G, region)

ie, maximal value of G in the region

To guarantee completeness A* requires admissible heuristics, ie, estimate

optimistically To ensure admissible heuristics

Function optimization gives tightest upper bound Analytical approaches Numeric analysis package

AIM 21

2. Configure descending space: disconnect uphills To guarantee optimality

A* requires descending heuristics To ensure descending heuristics

Remove uphill links

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M75 M56

154 2

AIM 22

Find right start point: Start from local optima To guarantee correctness

Every tuple state must be reachable from start states Taking only downhills requires start with high points

To ensure reachability Initial states should contain all local optima

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M75 M56

154 2

AIM 23

Putting together: Executing A* on the configured space

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M75 M56

154 2

M57…

Search is implemented as priority queue driven traversal

top-down

AIM 24

Putting together: Executing A* on the configured space

Bottom-up approach is always better than top-down

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M75 M56

154

2

M57

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M75 M56

154 2

M57…

top-down

bottom-up

AIM 25

Experiments

Comparison vs. Boolean then ranking Ranking then boolean

Metrics: node accessed = Nl + Nt

Settings: Benchmark queries over real dataset Controlled queries over synthetic dataset

AIM 26

Benchmark queries

Datasets: 19,706 real estate listing crawled online

Queries Q1: size * bedrms/| price-450k| : [40k<=price<=50k] Q2: size * ebedrms / |price-350k| : [price<400k^size>4000] Q3: size/price : [bedrms=3 ν bedrms=4]

BR_unclustered

BR_clustered

OPT*

Q1 Q2 Q3

AIM 27

Controlled queries Datasets

Three randomly generated datasets of 100k points Uniform, gaussian, logvariatenormal

Queries Linear average queries: (eg, 0.4*a + 0.6*b) Nearest neighbor queries: (eg, (x-3)^2 + (y-4)^2) Join queries: (0.4*R.a + 0.6*S.b: R.c=R.d)

!"#$

%

!"#$

! "#$%

AIM 28

Conclusion

Problem Study K-constrained optimization queries as boolean +

ranking Abstraction

Encode K-constrained optimization into shortest path problem

Framework Develop OPT* to process K-constrained optimization

AIM 29

Thank you!

Questions?

AIM 30

How to implement function optimization? How do we compare with RankSQL? If bottom-up is always better, why consider top-down Computing upper bound for each region is costly Random vs. sequential I/O Assuming indices on every attribute? Materialize state space for every query? Exponential number of states when attribute grows

Not every attribute has index on it Selective choose the right index (attribute) to use We do perform experiment to study how the system scale with

#attr Your algorithm is not optimal because you change the

space

Recommended