21
Reaching the Top-k of the Reaching the Top-k of the Skyline: A efficient Indexed Skyline: A efficient Indexed Algorithm for Top-k Skyline Algorithm for Top-k Skyline Queries Queries Marlene Goncalves and María-Esther Vidal Universidad Simón Bolívar, Caracas, Venezuela {mgoncalves,mvidal}@usb.ve Universidad Simón Universidad Simón Bolívar Bolívar

Reaching the Top-k of the Skyline: A efficient Indexed Algorithm for Top-k Skyline Queries Marlene Goncalves and María-Esther Vidal Universidad Simón Bolívar,

Embed Size (px)

Citation preview

Reaching the Top-k of the Skyline: A Reaching the Top-k of the Skyline: A efficient Indexed Algorithm for Top-k efficient Indexed Algorithm for Top-k Skyline QueriesSkyline Queries

Marlene Goncalves and María-Esther VidalUniversidad Simón Bolívar, Caracas, Venezuela

{mgoncalves,mvidal}@usb.ve

Universidad Simón Universidad Simón BolívarBolívar

Page 2

Motivating Example

«There are two Open Faculty Positions»

«Candidates will be evaluated in terms of:

Degree, Publications, Experience»

«Criteria to select the best Candidates: higher academic degree,

maximum number of publications and maximum years of experience»

«Ties will be broken by using the GPA»

Solutions: Skyline and Top-k

Page 3

4 MsC 13 4 3.65 BEng 7 3 4.5

Id Degree Publications Experience GPA

Query: Candidates with the best academic degree, number of publications and experience

Answer: None of the candidates is better in all criteria simultaneous.

Motivation

1 Post Dr 9 2 3.752 Post Dr 10 1 43 PhD 12 2 3.75

6 BEng 6 2 3.5

7 BEng 5 1 4

Page 4

4

Skyline

Query: Select the candidates with better degree, number of publications and experience

4 MsC 13 4 3.65 BEng 7 3 4.5

Id Degree Publications Experience GPA

1 Post Dr 9 2 3.752 Post Dr 10 1 43 PhD 12 2 3.75

6 BEng 6 3 3.5

7 BEng 5 1 4

User Criteria (Equally Important!)

• Degree Maximum

• Publications Maximum Multicriteria Function

• Experience Maximum

Skyline selects candidates 1,2,3 and 4.

i.e., multi-criteria induce a partial order, and ties need to be broken

Page 5

Top-k

Top-k

Select two candidates with the best GPA

1 Post Dr 9 2 3.753 PhD 12 2 3.75

Id Degree Publications Experience GPA

5 BEng 7 3 4.52 Post Dr 10 1 47 BEng 5 1 4

4 MsC 13 4 3.6

6 BEng 6 3 3.5

Top-k identifies candidates 5 and 2, but these candidates have not the best academic merit necessarily

User Criteria (Score Function!)

• GPA Maximum

Page 6

Preference based Queries

Select two candidates with higher GPA between the candidates with better degree, number of publications and Experience.

– Cases:

• Skyline produces the candidates with better degree, number of publications and Experience

– Skyline may be very huge and a post-processing over the Skyline is required to select k.

• Top-k identifies the two candidates with better GPA

– False answers

– Loss of results

Top-k selects two candidates with good GPA

Skyline selects four candidates in equality of conditions

So…

A combined approach is required!!

Page 7

Answer: The twocandidates with the highest value in score function between the candidates preselected in terms of multicriteria function `

Top-k Skyline

Query: Select two candidates with higher GPA between the candidates that have better degree, number of publications and experience

4 MsC 13 4 3.65 BEng 7 3 4.5

Id Degree Publications Experience GPA

1 Post Dr 9 2 3.752 Post Dr 10 1 43 PhD 12 2 3.75

6 BEng 6 3 3.5

7 BEng 5 1 4

Top-k

Top-k

Skyline

Top-k Skyline

Top-k Skyline selects candidates 1 and 2 with the highest GPAs among the ones with similar academic records

Page 8

Outline

Related Work

Our Approach

Top-k Skyline Evaluation

Experimental Study

Conclusions and Future Work

Page 9

Poor Ranking Capabilities

Multi-criteria-basedapproaches

Score-based Approaches

SKYLINE

High Ranking capabilities

Combined Approaches

BNL, SFS, LESS Top-kTop-k Skyline

MPro, Upper, TA, FA, NRA.

BMORTKS, BDTKSMetrics:Skyline Frequency

Related Work

Answers can be huge!

Answers may be incomplete

Neither Skyline nor Top-k provides high expressivity and high ranking capabilities.

Existing Techniques of Top-k Skyline completely build the Skyline.

Techniques to efficiently evaluate ranking approaches are required.

Page 10

Our Challenge

• Efficient Implementation of Top−k Skyline operator: Build the Top-k Skyline set minimizing the non-necessary probes.

A probe p of functions m or f is necessary if and only if p is evaluated on an object o that belongs to the Top-k Skyline.

4 MsC 13 4 3.6

5 BEng 7 3 4.5

Id Degree Publications Experience GPA

1 Post Dr 9 2 3.752 Post Dr 10 1 4

3 PhD 12 2 3.75

6 BEng 6 3 3.5

7 BEng 5 1 4

Non-Necessary Probes

(Evaluations of multi-criteria or score function)!

Goal: Only identify the elements of the Skyline that belongs to the answer

Page 11 Pagina

Top-k Skyline Evaluation

Indexed Solutions

– BDTKS (Basic Distributed Top-k Skyline)

– BMORTKS (Basic Multi-Objective Retrieval for Top-k Skyline)

– TKSI (Top-K SkyIndex)

Page 12

BDTKS

Top-k Skyline Evaluation

Query: Select two candidates with higher GPA between the candidates that have better degree, number of publications and experience.

5 7

4 13

Id Publications

1 9

2 10

3 12

6 6

7 5

4 4

5 3

Id Experience

1 2

2 1

3 2

6 3

7 1

4 MsC

5 BEng

Id Degree

1 Post Dr2 Post Dr

3 PhD

6 BEng

7 BEng

Final Object!

Index 1 Index 2 Index 3

Page 13

2 Post Dr 10 1 4

BDTKS

Top-k Skyline Evaluation

Query: Select two candidates with higher GPA between the candidates that have better degree, number of publications and Experience

4 MsC 13 4 3.6

Id Degree Publications Experience GPA

1 Post Dr 9 2 3.753 PhD 12 2 3.75

Partial Scanning of database (the final object is found)But, BDTKS completely builds the Skyline.

Page 14

BMORTKS

Top-k Skyline Evaluation

Query: Select two candidates with higher GPA between the candidates that have better degree, number of publications and experience.

4 MsC

5 BEng

Id Degree

1 Post Dr2 Post Dr

3 PhD

6 BEng

7 BEng

5 7

4 13

Id Publications

1 9

2 10

3 12

6 6

7 5

4 4

5 3

Id Experience

1 2

2 1

3 2

6 3

7 1

PostDr,?,?PostDr,13,4PostDr,13,?PostDr,12,4PhD,12,3PostDr,12,3PostDr,13,4PhD,10,3PhD,10,3MsC,10,3MsC,9,3

Virtual (Last score seen):

Index 1 Index 2 Index 3

Page 15

2 Post Dr 10 1 4

BMORTKS

Top-k Skyline Evaluation

Query: Select the two candidates with higher GPA between the candidates that have better degree, number of publications and experience

4 MsC 13 4 3.6

Id Degree Publications Experience GPA

1 Post Dr 9 2 3.753 PhD 12 2 3.75Partial Scanning of database (until a seen object

dominates the final object)But, BMRTKS also completely builds the Skyline

Page 16

TKSI (Top-K SkyIndex)

Top-k Skyline Evaluation

1 3.75

3 3.75

Id GPA

5 4.5

2 4

7 4

4 3.6

6 3.5

4 MsC

5 BEng

Id Degree

1 Post Dr

2 Post Dr

3 PhD

6 BEng

7 BEng

5 7

4 13

Id Publications

1 9

2 10

3 12

6 6

7 5

4 4

5 3

Id Experience

1 2

2 1

3 2

6 3

7 1

Partial Scanning of database (until k incomparable objects are found)TKSI partially builds the Skyline, and minimizes the non-necessary probes

Index 1 Index 2 Index 3 Index 4

Page 17 Pagina

Dataset and Queries

– 100.000 Random data:

• Value Domain: Float between 0 and 1

• Data Distribution: Uniform, Gaussian and Mixed

– Sixty random queries. Multi-criteria dimensions range between 2-6.

Plataform

– SunFire V440, OS SunOS 5.10, two processors Sparcv9 of 1.281 MHZ, 16 GB of RAM and four disks Ultra320 SCSI of 73 GB.

– Java 1.5 and Oracle 9i.

Experimental Study

Page 18 Pagina

Average Skyline Size & Probes

Experimental Study

Data Distribution Average Skyline Size

(60 queries)

Uniform 2405

Gaussian 2477

Mixed 2539

Skyline size can be up to 2.6% of the input data!

Probes

BDTKS BMORTKS23,749,796 27,201,877

Probes on virtual object increase the number of probes of multi-criteria function!

Page 19 Pagina

BDTKS and TKSI

Experimental Study

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

8,0

BDTKS k=1 k=10 k=50 k=100 k=500 k=1000

Log(#

Pro

bes) .

BDTKS k=1000

0,0

1,0

2,0

3,0

4,0

5,0

6,0

BDTKS k=1 k=10 k=50 k=100 k=500 k=1000

Log(#

Acc

ess)

.

BDTKS k=1000

0,0

1,0

2,0

3,0

4,0

5,0

6,0

BDTKS k=1 k=10 k=50 k=100 k=500 k=1000

Log(#

See

n O

bje

cts)

.

BDTKS k=1000

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

BDTKS k=1 k=10 k=50 k=100 k=500 k=1000

Log(T

ime

(sec

))

.

BDTKS k=1000

BDTKS executes less probes and requires less evaluation time than BMORTKS.

For small k, TKSI outperforms BDTKS!

Page 20

TKSI builds the Skyline until it has calculated the k objects.

Our experimental results show that TKSI executed less probes and consumed less evaluation time.

In the Future, we plan to extend TKSI over Web data sources, and incorporate the TKSI into an existing DBMS.

Conclusions and Future Work

Thanks!

Q&A