Searchworkings.org - The online search community
Overview
Grouping & Joining
‣ Background
‣ Joining
‣ Result grouping
‣ Conclusion
2
Thursday, May 17, 2012
Searchworkings.org - The online search community
Lucene’s model
Background
‣ Lucene is document based.
‣ Lucene doesn’t store information about relations between documents.
‣ Data often holds relations.
‣ Good free text search over relational data.
3
Thursday, May 17, 2012
Searchworkings.org - The online search community
Example
Background
‣ Product
‣ Name
‣ Description
‣ Product-item
‣ Color
‣ Size
‣ Price
‣ Goal: Show the most applicable product based on product-item criteria.
4
Thursday, May 17, 2012
Searchworkings.org - The online search community
Common Lucene solutions
Background
‣ Compound documents.
‣May result in documents with many fields.
‣ Subsequent searches.
‣May cause a lot network overhead.
‣ Non Lucene based approach:
‣ If free text search isn’t very important use a relational database.
5
Thursday, May 17, 2012
Searchworkings.org - The online search community
Example domain
Background
‣ Compound Product & Product-items document.
‣ Each product-item has its own field prefix.
6
Thursday, May 17, 2012
Searchworkings.org - The online search community
Different solutions
Background
‣ Lucene offers solutions to have a 'relational' like search.
‣ Parent child
‣ Grouping & joining aren't naturally supported.
‣ All the solutions do increase the search time.
‣ Some scenarios grouping and joining isn't the right solution.
7
Thursday, May 17, 2012
Joining
Modelling relations
Thursday, May 17, 2012
Searchworkings.org - The online search community
Introduction
Joining
‣ Support for parent child like search from Lucene 3.4
‣ Not a SQL join.
‣ The parent and each children are stored as documents.
‣ Two types:
‣ Index time join
‣ Query time join
9
Thursday, May 17, 2012
Searchworkings.org - The online search community
Index time join
Joining
‣ Two block join queries:
‣ ToParentBlockJoinQuery
‣ ToChildBlockJoinQuery
‣ One Lucene collector:
‣ ToParentBlockJoinCollector
‣ Index time join requires block indexing.
10
Thursday, May 17, 2012
Searchworkings.org - The online search community
Block indexing
Joining
‣ Atomically adding documents.
‣ A block of documents.
‣ Each document gets sequentially assigned Lucene document id.
‣ IndexWriter#addDocuments(docs);
11
Thursday, May 17, 2012
Searchworkings.org - The online search community
Block indexing
Joining
‣ Index doesn't record blocks.
‣ Segment merging doesn’t re-order documents in a segment.
‣ App is responsible for identifying block documents.
‣Marking the last document in a block.
‣ Adding a document to a block requires you to reindex the whole block.
‣ Removing a document from a block doesn’t requires reindexing a block.
12
Thursday, May 17, 2012
Searchworkings.org - The online search community
Example domain
Joining
‣ Parent is the last document in a block.
13
Thursday, May 17, 2012
Searchworkings.org - The online search community
Block indexing
Joining
14
Marking parent documents
Thursday, May 17, 2012
Searchworkings.org - The online search community
Block indexing
Joining
15
Add block
Add block
Thursday, May 17, 2012
Searchworkings.org - The online search community
‣ Parent filter marks the parent documents.
‣ Child query is executed in the parent space.
‣ ToChildBlockJoinQuery works in the opposite direction.
ToParentBlockJoinQuery
Joining
16
Thursday, May 17, 2012
Searchworkings.org - The online search community
Query time joining
Joining
‣ Query time joining is executed in two phases and is field based:
‣ fromField
‣ toField
‣ Doesn’t require block indexing.
17
Thursday, May 17, 2012
Searchworkings.org - The online search community
Query time joining
Joining
‣ First phase collects all the terms in the fromField for the documents that match with the original query.
‣ Currently doesn’t take the score from original query into account.
‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField.
‣ Two different implementations:
‣ JoinUtil - Lucene (≥ 3.6)
‣ Join query parser - Solr (trunk)18
Thursday, May 17, 2012
Searchworkings.org - The online search community
Query time joining - Indexing
Joining
19
Referrer the product id.
Thursday, May 17, 2012
Searchworkings.org - The online search community
Query time joining - Indexing
Joining
20
Thursday, May 17, 2012
Searchworkings.org - The online search community
Query time joining
Joining
21
‣ Result will contain one product.
‣ Possible to join over two indices.
Thursday, May 17, 2012
Searchworkings.org - The online search community
Final thoughts
Joining
‣ Joining module has good solutions to model parent child relations.
‣ Use block join if you care about scoring.
‣ Frequent updates can be problematic.
‣ Use query time join for parent child filtering.
‣ Query time join is slower than index time join.
‣Mostly a Lucene feature only.
‣ All code is annotated as experimental.22
Thursday, May 17, 2012
Result grouping
Previously known as Field Collapsing.
Thursday, May 17, 2012
Searchworkings.org - The online search community
Introduction
Result grouping
‣ Group matching documents that share a common property.
‣ Search hit represents a group.
‣ Facet counts & total hit count represent groups.
‣ Per group collect information
‣Most relevant document.
‣ Top three documents.
‣ Aggregated counts24
Thursday, May 17, 2012
Searchworkings.org - The online search community
Usages
Result grouping
‣ Group documents by a shared property
‣ Product-item by product id (Parent child)
‣ Collapse similar looking documents
‣ E.g. all results from the Wikipedia domains.
‣ Remove duplicates from the search result.
‣ Based on a field that contains a hash
25
Thursday, May 17, 2012
Searchworkings.org - The online search community
Example domain
Result grouping
‣ Each Product-item is a document, but includes the product data.
26
Thursday, May 17, 2012
Searchworkings.org - The online search community
Implementation
Result grouping
‣ Result grouping implemented with Lucene collectors.
‣Module in trunk and a contrib in 3.x versions.
‣ Two pass result grouping.
‣ Grouping by indexed field, function or doc values.
‣ Single pass result grouping.
‣ Requires block indexing.
27
Thursday, May 17, 2012
Searchworkings.org - The online search community
Two pass implementation
Result grouping
‣ First pass collects the top N groups.
‣ Per group: group value + sort value
‣ Second pass collects data for each top group.
‣ The top N documents per group.
‣ Possible other aggregated information.
‣ Second pass search ignores all documents outside topN groups.
28
Thursday, May 17, 2012
Searchworkings.org - The online search community
Result grouping - Indexing
Result grouping
29
Thursday, May 17, 2012
Searchworkings.org - The online search community
Result grouping - Searching
Result grouping
30
Thursday, May 17, 2012
Searchworkings.org - The online search community
Result grouping made easier
Result grouping
31
‣ GroupingSearch
‣ Solr
‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id
‣Many more options:
‣ http://wiki.apache.org/solr/FieldCollapsing
Thursday, May 17, 2012
Searchworkings.org - The online search community
Parent child result
Result grouping
‣ TopGroups - Equivalent to TopDocs.
‣ Hit count
‣ Group count
‣ Groups
‣ Top documents
‣ Facet and total count can represent groups instead of documents.
‣ But requires more query time.
32
Thursday, May 17, 2012
Conclusion
Compare...
Thursday, May 17, 2012
Searchworkings.org - The online search community
Compare the parent child solutions
Conclusion
‣ Result grouping
‣ + Distributed support & Parent child relation as hit.
‣ - Parent data duplication
‣ - Impact on query time
‣ Joining
‣ + Fast & no data duplication
‣ - Index time join not optimal for updates
‣ - Query time join is limited.34
Thursday, May 17, 2012
Searchworkings.org - The online search community
Compare the parent child solutions
Conclusion
‣ Compound documents.
‣ + Fast and works out-of-the box with all features.
‣ - Not flexible when it comes to updates.
‣ - Document granularity is set in stone.
35
Thursday, May 17, 2012
36
Any questions?
Thursday, May 17, 2012
Extra slides
We have time left!
Thursday, May 17, 2012
Searchworkings.org - The online search community
Future work
Conclusion
‣ Higher level parent-child API.
‣ Needs to cover search & indexing.
‣ Joining
‣ Distributed support.
‣ Represent a hit as a parent child relation in the search result.
‣ Result grouping
‣ Aggregated grouped information like: sum, avg, min, max etc.
38
Thursday, May 17, 2012
Searchworkings.org - The online search community
ToParentBlockJoinCollector
Joining
‣ TopGroups contains a group per top N parent document.
‣ Each group contains a parent and child documents.
39
Thursday, May 17, 2012
Searchworkings.org - The online search community
Groups & facet counts
Result grouping
‣ Faceting and result grouping are different features.
‣ But are often used together!
‣ Facet counts can be based on:
‣ Found documents.
‣ Found groups.
‣ Combination of facet value and group.
‣ All options are supported in Solr.40
Thursday, May 17, 2012
Searchworkings.org - The online search community
Doc values
Result grouping
‣ Doc values / Column Stride values
‣ Prevents the creation of expensive data structures in FieldCache.
‣ Inverted index is meant for free text search.
‣ All grouping collectors have doc values based implementations!
41
Thursday, May 17, 2012
Recommended