55
Approaching Join Index yet another one join algorithm Mikhail Khludnev principal engineer

Approaching Join Index - Lucene/Solr Revolution 2014

Embed Size (px)

Citation preview

Approaching Join Indexyet another one join algorithm

Mikhail Khludnevprincipal engineer

PRIVILEGED AND CONFIDENTIAL

• Grid Dynamics is a Silicon Valley-based leading provider of scalable, next-generation

commerce technology solutions

• Record of outperformance with Tier 1 retail clients

• Fortune 1000 client relationships

About Me

● principal engineer at Grid Dynamics

● spoke at few last LuceneRevolutions

● contributed BlockJoin query parser for Solr - {!parent}

● blogged about it at http://blog.griddynamics.com/

● tried to fix threads at DataImportHandler

http://google.com/+MikhailKhludnev

You are expected to know

● how Lucene searches/filters

● how it counts facets

● that there are segments

● what is DocValues

● why to join

● RDBMS joins: nested loop join, sort-merge join and hash join.

I’m expected to know

● query-time join

● index-time join

● yet another one join

Lucene/Solr Is Strong

● searching

○ filtering

● analytics

○ facets

○ pivots

○ stats

Lucene/Solr Is Weak in

● robust joins

○ multiple entities

○ relations

Joins in General

Joins in General

Joins in General

PK=FK

Joins in General

PK=FK

Joins in General

PK=FK

children

Joins in General

1:M

parents

Executing Join Query

q

Executing Join Query

q

Executing Join Query

q

Executing Join Query

qfq

Executing Join Query

qfq

Join in General

parents ∩ join-relation ∩ children

Join in General

parents ∩ join-relation ∩ children

● input

● output

● enumeration order

JoinUtil

q

“25”

“17”

“17”

“25”

“25”

“56”

“56”

“56”

“25”

“4”

“61”

FK[doc#]

JoinUtil

q

FK[doc#]

“17”

“17”

“25”

“25”

“56”

“56”

“56”

“25”

“4”

“61”

“25”

“25”

“17”

...

JoinUtil

q

PK

“1” →△…

“17”→△“25”→△...

“25”

“17”

...

“25”

“17”

“17”

“25”

“25”

“56”

“56”

“56”

“25”

“4”

“61”

FK[doc#]

JoinUtil

q

“1” →△…

“17”→△“25”→△...

“25”

“17”

...

fq

“25”

“17”

“17”

“25”

“25”

“56”

“56”

“56”

“25”

“4”

“61”

FK[doc#]PK

Block Join

doc#

Block Join

doc#

Block Join

doc#1

0

0

1

0

0

1

0

0

1

0

Block Join

doc#1

0

0

1

0

0

1

0

0

1

0

q

Block Join

doc#1

0

0

1

0

0

1

0

0

1

0

qfq

Block Join

doc#1

0

0

1

0

0

1

0

0

1

0

qfq

Comparison

JoinUtil BlockJoin

searching slow fast

reindexing

Comparison

JoinUtil BlockJoin

searching slow fast

reindexing fast slow

Comparison

doc#

JoinUtil BlockJoin

searching slow fast

reindexing fast slow

Comparison

doc#

JoinUtil BlockJoin

searching slow fast

reindexing fast slow

Comparison

doc#

JoinUtil BlockJoin

searching slow fast

reindexing fast slow

Comparison

JoinUtil BlockJoin

searching slow < ? < fast

reindexing fast > ? > slow

Join Index

q

doc#[doc#]

3

6

0

3

10

0

6

3

Join Index

q

doc#[doc#]

3

6

0

3

10

0

6

3

Join Index

q

doc#[doc#]

fq 3

6

0

3

10

0

6

3

Join Index

q

fq

doc#[doc#]

2

4

1

6

5 10

9

8

Join Index

q

doc#[doc#]

fq

2

4

1

6

5 10

9

8

Join Index

q

doc#[doc#]

fq

2

4

1

6

5 10

9

8

Benchmarking 2.9 M docs

https://github.com/m-khl/lucene-solr/tree/dvjoin-benchmark

Latency, ms

the bigger the worse

10

27

14

BlockJoin

(i-time)

Join Index

JoinUtil

(q-time)

doc#[doc#]

Indexing is still a problem

2

4

3

6

1

0

3

10

0

6

3

5 10

9

8

Further Plans

● incremental join-index update

● perhaps just calculate and cache it

● or put to dedicated index

● join in both directions

● calculate optimal execution plan of segments enumeration

Summary

● Joins in General

● JoinUtil vs Block-join

● {!scorejoin } - SOLR-6234

● updatable DocValues

● opportunities for improving query-time joins:

○ eliminate term enum

○ choose lower cardinality side for enumeration

References

● Searching relational content with Lucene's BlockJoinQuery

http://blog.mikemccandless.com/

● Solr Experience: search parent-child relations. Part I

Solr block-join support

http://blog.griddynamics.com/

● https://wiki.apache.org/solr/Join

● http://www.slideshare.net/martijnvg/document-relations

● https://issues.apache.org/jira/browse/SOLR-6234 {!scorejoin }

● Updatable DocValues Under the Hood

http://shaierera.blogspot.com/

● Subject: How to openIfChanged the most recent merge?

at: [email protected]

Thanks for Joining us!

http://goo.gl/aYsxiD

Off scope

Joins’ Zoo in Lucene

True Joins

● query-time join

○ JoinUtil

○ {!join }

○ {!scorejoin } - SOLR-6234

● index-time join aka block-join

{!parent}

Joins’ Zoo in Lucene

Workarounds

● term positions/SpanQueries

● FieldCollapsing/Grouping

● term decoration

○ spatial

● multivalue fields

True Joins

● query-time join

○ JoinUtil

○ {!join },

○ {!scorejoin } - SOLR-6234

● index-time join aka block-join

{!parent}

Joins’ Zoo in Lucene

Workarounds

● term positions/SpanQueries

● FieldCollapsing/Grouping

● term decoration

○ spatial

● multivalue fields

True Joins

● query-time join

○ JoinUtil

○ {!join },

○ {!scorejoin } - SOLR-6234

● index-time join aka block-join

{!parent}

Two phase update problem

Subject: How to openIfChanged the most recent merge?

at: [email protected]

JoinUtil

● query-time

● indexing is fast

● searching is slow, why?

○ expensive term enum

○ single enumeration order

BlockJoin

● index-time

● reindexing whole block is as expensive as mandatory

● searching is darn fast, however

○ can’t reorder child docs