25
1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

Embed Size (px)

DESCRIPTION

3 Cluster Partitioning Based B + -tree

Citation preview

Page 1: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

1

Queryy Sampling Based High Dimensional Hybrid Index

Junqi Zhang, Xiangdong ZhouFudan University

Page 2: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

2

Nearest Neighbors Query

Dims Overlap Accessed

O3

Q

Query cover area

O1

O2

r1

P1

Q

Query cover area

O1P1

r2

O2

Page 3: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

3

Cluster Partitioning Based B+-tree

Oi

Cluster i

Oi

Core Sub-cluster i Marginal Sub-cluster i

split

rrc

Oi

Cluster splitting

Page 4: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

4

Index Structure

Oi Oi Oi

Q

Oi

Leafe nodes of B+-tree

Marginal sub-cluster1

Corel sub-cluster1Corel sub-cluster i

Marginal sub-cluster i

Query cover area

C x 1 C x 2 C x i C x ( i+1)

...

Index key

Q

where C is a hash factor

What’s the optimal extent to partition ?

iDistance : by experiments Ours: by cost model to predict

Page 5: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

5

Object of Cluster Partitioning - Lowest Query Cost Appropriate M :

Distribute M to each cluster

Overall number of clusters :

HuNNMQKNNNodes c

M2)))(((minarg

HuNN opt

2

Page 6: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

6

Dimension Curse dim>10 : tree<scan< VA-file

dim<10 : tree>scan> VA-file

Non uniform : tree VA-file

VA-file defectHow to improve tree performance ?

Page 7: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

7

Tree and scan—which better ? tree advantage : filter data instead of linear scan the whole file disadvantage : position cost for each data is the height of

intermediate nodes,which is higher than scan

scan advantage : position cost for each data is 0 disadvantage : linear scan the whole file

Page 8: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

8

Cost that view from each point

(C<1) : tree - useful - compared with scan

( C>=1) : tree - useless - compared with scan

)cos

cos(

scanlinearbytaveragetreeontaverage

C

Page 9: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

9

Data distribution and index performance Known work : index data in a single

index DIMS tree

Real image data set : Non uniform

Non uniform data aggregate tree

FAST

Page 10: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

10

Data type Sparse data tree<scan

Dense data

tree>scan

Page 11: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

11

Hybrid data type hybrid index

hybrid index

Sequencial file B + -tree

Sparse data dense data

tree<scan tree>scan

Page 12: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

12

How to differentiate data type ? Each data as a unit

difficult

Each cluster ring as a unit easier

Page 13: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

13

Clsuter partitioning

cluster middle circleout circle

cluster split

r

Q

O1

QQ Q

O2 O3

inner circle

rr2

r3

What extent ? HuNNMQKNNNodes c

M2)))(((minarg

Page 14: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

14

Clsuter partitioning based B+-tree

c x 1 c x 2 c x i c x ( i+1)

O12O11Oi2O13

c x3

Oi1

...

...

Marginal data file

Leafe nodes of B+-treeIndex keywhere c is a hash factor

Q

Query cover area

Page 15: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

15

Clsuter partitioning based image retrieval system

Outer rings of custers are often accessed

Page 16: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

16

Some rings of custers are often accessed

Treat outer rings as sparse rings?

Page 17: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

17

Frequence of being accessed for each ring

Page 18: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

18

Page 19: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

19

Hybrid index - cut branches( according to the contribution of each ring to the query cost )

Expected cost

)+(u

NH ciP(ci)

Cost by linear scan b

Nci

C x 1 C x 2 C x i C x ( i+1)

O12O11Oi2O13

C x3

Oi1

...

...

Marginal data file

Leafe nodes of B+-treeIndex keywhere C is a hash factor

Q

Query cover area

Page 20: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

20

Standard of rings being cut - Index

Capability IC ( index capability ):

Question : how to determine ?

)+(u

NHb

N cici P(ci)IC i

P(ci)

Page 21: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

21

Estimate - query samping

Question : for large database , lot of queris bring expensive

cost Object : given confidence a% , make minimum

P(ci)

Qqueriesofnumberaccessedbeingcringoftimes i

:P(ci)

Q

Page 22: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

22

Threshold of rings being cut When IC equal 0 :

Rule : When the probability of ring being accessed by

queries is lower than this threshold, this ring should remain in the tree, or else, it should be cut into the sequence file for linear scan.

0P(ci)IC i )+(u

NHb

N cici

ci

cicici

bNHubuN

uNH

bN

+=)+(= P(ci)P(ci)

Page 23: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

23

Query sampling algorithm : When or or

, stop sampling.

User can balance the accuracy and efficiency of sampling by tuning the confidence a% , and the complexity of this algorithm is less than . N

)1(P(ci) 2/ ntn

SbNHub

uNa

i

ci

ci

)1(P(ci) 2/ ntn

SbNHub

uNa

i

ci

ci

0ierror

Page 24: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

24

Query algorithm of hybrid index Linear scan the sequence file for sparse

data

Retrieve the dense data on the B+-tree

Page 25: 1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

25

Thanks!Thanks!