Transcript
Page 1: Grouping and Joining in Lucene/Solr

Grouping & Joining

Martijn van [email protected] Committer & PMC Member

Thursday, May 17, 2012

Page 2: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Overview

Grouping & Joining

‣ Background

‣ Joining

‣ Result grouping

‣ Conclusion

2

Thursday, May 17, 2012

Page 3: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Lucene’s model

Background

‣ Lucene is document based.

‣ Lucene doesn’t store information about relations between documents.

‣ Data often holds relations.

‣ Good free text search over relational data.

3

Thursday, May 17, 2012

Page 4: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Example

Background

‣ Product

‣ Name

‣ Description

‣ Product-item

‣ Color

‣ Size

‣ Price

‣ Goal: Show the most applicable product based on product-item criteria.

4

Thursday, May 17, 2012

Page 5: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Common Lucene solutions

Background

‣ Compound documents.

‣May result in documents with many fields.

‣ Subsequent searches.

‣May cause a lot network overhead.

‣ Non Lucene based approach:

‣ If free text search isn’t very important use a relational database.

5

Thursday, May 17, 2012

Page 6: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Example domain

Background

‣ Compound Product & Product-items document.

‣ Each product-item has its own field prefix.

6

Thursday, May 17, 2012

Page 7: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Different solutions

Background

‣ Lucene offers solutions to have a 'relational' like search.

‣ Parent child

‣ Grouping & joining aren't naturally supported.

‣ All the solutions do increase the search time.

‣ Some scenarios grouping and joining isn't the right solution.

7

Thursday, May 17, 2012

Page 8: Grouping and Joining in Lucene/Solr

Joining

Modelling relations

Thursday, May 17, 2012

Page 9: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Introduction

Joining

‣ Support for parent child like search from Lucene 3.4

‣ Not a SQL join.

‣ The parent and each children are stored as documents.

‣ Two types:

‣ Index time join

‣ Query time join

9

Thursday, May 17, 2012

Page 10: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Index time join

Joining

‣ Two block join queries:

‣ ToParentBlockJoinQuery

‣ ToChildBlockJoinQuery

‣ One Lucene collector:

‣ ToParentBlockJoinCollector

‣ Index time join requires block indexing.

10

Thursday, May 17, 2012

Page 11: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Block indexing

Joining

‣ Atomically adding documents.

‣ A block of documents.

‣ Each document gets sequentially assigned Lucene document id.

‣ IndexWriter#addDocuments(docs);

11

Thursday, May 17, 2012

Page 12: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Block indexing

Joining

‣ Index doesn't record blocks.

‣ Segment merging doesn’t re-order documents in a segment.

‣ App is responsible for identifying block documents.

‣Marking the last document in a block.

‣ Adding a document to a block requires you to reindex the whole block.

‣ Removing a document from a block doesn’t requires reindexing a block.

12

Thursday, May 17, 2012

Page 13: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Example domain

Joining

‣ Parent is the last document in a block.

13

Thursday, May 17, 2012

Page 14: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Block indexing

Joining

14

Marking parent documents

Thursday, May 17, 2012

Page 15: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Block indexing

Joining

15

Add block

Add block

Thursday, May 17, 2012

Page 16: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

‣ Parent filter marks the parent documents.

‣ Child query is executed in the parent space.

‣ ToChildBlockJoinQuery works in the opposite direction.

ToParentBlockJoinQuery

Joining

16

Thursday, May 17, 2012

Page 17: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Query time joining

Joining

‣ Query time joining is executed in two phases and is field based:

‣ fromField

‣ toField

‣ Doesn’t require block indexing.

17

Thursday, May 17, 2012

Page 18: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Query time joining

Joining

‣ First phase collects all the terms in the fromField for the documents that match with the original query.

‣ Currently doesn’t take the score from original query into account.

‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField.

‣ Two different implementations:

‣ JoinUtil - Lucene (≥ 3.6)

‣ Join query parser - Solr (trunk)18

Thursday, May 17, 2012

Page 19: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Query time joining - Indexing

Joining

19

Referrer the product id.

Thursday, May 17, 2012

Page 20: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Query time joining - Indexing

Joining

20

Thursday, May 17, 2012

Page 21: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Query time joining

Joining

21

‣ Result will contain one product.

‣ Possible to join over two indices.

Thursday, May 17, 2012

Page 22: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Final thoughts

Joining

‣ Joining module has good solutions to model parent child relations.

‣ Use block join if you care about scoring.

‣ Frequent updates can be problematic.

‣ Use query time join for parent child filtering.

‣ Query time join is slower than index time join.

‣Mostly a Lucene feature only.

‣ All code is annotated as experimental.22

Thursday, May 17, 2012

Page 23: Grouping and Joining in Lucene/Solr

Result grouping

Previously known as Field Collapsing.

Thursday, May 17, 2012

Page 24: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Introduction

Result grouping

‣ Group matching documents that share a common property.

‣ Search hit represents a group.

‣ Facet counts & total hit count represent groups.

‣ Per group collect information

‣Most relevant document.

‣ Top three documents.

‣ Aggregated counts24

Thursday, May 17, 2012

Page 25: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Usages

Result grouping

‣ Group documents by a shared property

‣ Product-item by product id (Parent child)

‣ Collapse similar looking documents

‣ E.g. all results from the Wikipedia domains.

‣ Remove duplicates from the search result.

‣ Based on a field that contains a hash

25

Thursday, May 17, 2012

Page 26: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Example domain

Result grouping

‣ Each Product-item is a document, but includes the product data.

26

Thursday, May 17, 2012

Page 27: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Implementation

Result grouping

‣ Result grouping implemented with Lucene collectors.

‣Module in trunk and a contrib in 3.x versions.

‣ Two pass result grouping.

‣ Grouping by indexed field, function or doc values.

‣ Single pass result grouping.

‣ Requires block indexing.

27

Thursday, May 17, 2012

Page 28: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Two pass implementation

Result grouping

‣ First pass collects the top N groups.

‣ Per group: group value + sort value

‣ Second pass collects data for each top group.

‣ The top N documents per group.

‣ Possible other aggregated information.

‣ Second pass search ignores all documents outside topN groups.

28

Thursday, May 17, 2012

Page 29: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Result grouping - Indexing

Result grouping

29

Thursday, May 17, 2012

Page 30: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Result grouping - Searching

Result grouping

30

Thursday, May 17, 2012

Page 31: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Result grouping made easier

Result grouping

31

‣ GroupingSearch

‣ Solr

‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id

‣Many more options:

‣ http://wiki.apache.org/solr/FieldCollapsing

Thursday, May 17, 2012

Page 32: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Parent child result

Result grouping

‣ TopGroups - Equivalent to TopDocs.

‣ Hit count

‣ Group count

‣ Groups

‣ Top documents

‣ Facet and total count can represent groups instead of documents.

‣ But requires more query time.

32

Thursday, May 17, 2012

Page 33: Grouping and Joining in Lucene/Solr

Conclusion

Compare...

Thursday, May 17, 2012

Page 34: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Compare the parent child solutions

Conclusion

‣ Result grouping

‣ + Distributed support & Parent child relation as hit.

‣ - Parent data duplication

‣ - Impact on query time

‣ Joining

‣ + Fast & no data duplication

‣ - Index time join not optimal for updates

‣ - Query time join is limited.34

Thursday, May 17, 2012

Page 35: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Compare the parent child solutions

Conclusion

‣ Compound documents.

‣ + Fast and works out-of-the box with all features.

‣ - Not flexible when it comes to updates.

‣ - Document granularity is set in stone.

35

Thursday, May 17, 2012

Page 36: Grouping and Joining in Lucene/Solr

36

Any questions?

Thursday, May 17, 2012

Page 37: Grouping and Joining in Lucene/Solr

Extra slides

We have time left!

Thursday, May 17, 2012

Page 38: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Future work

Conclusion

‣ Higher level parent-child API.

‣ Needs to cover search & indexing.

‣ Joining

‣ Distributed support.

‣ Represent a hit as a parent child relation in the search result.

‣ Result grouping

‣ Aggregated grouped information like: sum, avg, min, max etc.

38

Thursday, May 17, 2012

Page 39: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

ToParentBlockJoinCollector

Joining

‣ TopGroups contains a group per top N parent document.

‣ Each group contains a parent and child documents.

39

Thursday, May 17, 2012

Page 40: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Groups & facet counts

Result grouping

‣ Faceting and result grouping are different features.

‣ But are often used together!

‣ Facet counts can be based on:

‣ Found documents.

‣ Found groups.

‣ Combination of facet value and group.

‣ All options are supported in Solr.40

Thursday, May 17, 2012

Page 41: Grouping and Joining in Lucene/Solr

Searchworkings.org - The online search community

Doc values

Result grouping

‣ Doc values / Column Stride values

‣ Prevents the creation of expensive data structures in FieldCache.

‣ Inverted index is meant for free text search.

‣ All grouping collectors have doc values based implementations!

41

Thursday, May 17, 2012