57
Numeric Range Queries in Lucene and Solr [email protected]

Numeric Range Queries in Lucene and Solr

Embed Size (px)

DESCRIPTION

Presentation covers core lucene/solr stuff which is used in numeric range queries. There are several examples, algorithm discovered by Uwe is briefly explained.

Citation preview

Page 1: Numeric Range Queries in Lucene and Solr

Numeric Range Queriesin Lucene and Solr

[email protected]

Page 2: Numeric Range Queries in Lucene and Solr

Agenda:

● What is RangeQuery

● Which field type to use for Numerics

● Range stuff under the hood (run!)

● NumericRangeQuery

● Useful links

Page 3: Numeric Range Queries in Lucene and Solr

Agenda:

● What is RangeQuery

● Which field type to use for Numerics

● Range stuff under the hood (run!)

● NumericRangeQuery

● Useful links

Page 4: Numeric Range Queries in Lucene and Solr

Range Queries:

A range query is a type of query that matches all documents where some value is between an

upper and lower boundary:

Give me:● Jeans with price from 200 to 300$● Car with length from 5 to 10m● ...

Page 5: Numeric Range Queries in Lucene and Solr

Range Queries:

In solr range query is as simple as:

q = field:[100 TO 200]

We will talk about Numeric Range Queriesbut you can use range queries for text too:q = field:[A TO Z]

Page 6: Numeric Range Queries in Lucene and Solr

Agenda:

● What is RangeQuery

● Which field type to use for Numerics

● Range stuff under the hood (run!)

● NumericRangeQuery

● Useful links (relax)

Page 7: Numeric Range Queries in Lucene and Solr

Which field type?

Which field type to use for “range” fields (let’s stick with int) in schema?

● solr.IntField● or maybe solr.SortableIntField● or maybe solr.TrieIntField

Page 8: Numeric Range Queries in Lucene and Solr

Which field type?

Let’s assume we have:● 11 documents, id: 1,2,3,..11● each doc has single value “int” price field ● document id is the same as it’s price● q = *:*

"numFound": 11, "docs": [ { "id": 1, “price_field": 1 },

{ "id": 2, “price_field": 2 }, ... { "id": 11, “price_field": 11 }]

Page 9: Numeric Range Queries in Lucene and Solr

Which field type - solr.IntField

q = price_field:[1 TO 10]

Page 10: Numeric Range Queries in Lucene and Solr

Which field type - solr.IntField

q = price_field:[1 TO 10]"numFound": 2, "start": 0, "docs": [ { "price_field": 1 }, { "price_field": 10 } ] }

Page 11: Numeric Range Queries in Lucene and Solr
Page 12: Numeric Range Queries in Lucene and Solr

Which field type - solr.IntField

Store and index the text value verbatim and hence don't correctly support range queries, since the lexicographic ordering isn't equal to

the numeric ordering

[1,10],11,2,3,4,5,6,7,8,9

Interesting, but “sort by” works fine..Clever comparator knows that valuesare ints!

Page 13: Numeric Range Queries in Lucene and Solr

Which field type - solr.SortableIntField

● q = price_field:[1 TO 10]○ "numFound": 10

● “Sortable”, in fact, refer to the notion of making the numbers have correctly sorted order. It’s not about “sort by” actually!

● Processed and compared as strings!!!tricky string encoding:NumberUtils.int2sortableStr(...)

● Deprecated and will be removed in 5.X● What should i use then?

Page 14: Numeric Range Queries in Lucene and Solr

Which field type - solr.TrieIntField

● q = price_field:[1 TO 10]○ "numFound": 10

● Recommended as replacement for IntField and SortableIntField in javadoc

● Default for primitive fields in reference schema

● Said to be fast for range queries (actually depends on precision step)

● Tricky and, btw wtf is precision step?

Page 15: Numeric Range Queries in Lucene and Solr

Agenda:

● What is RangeQuery

● Which field type to use for Numerics

● Range stuff under the hood (run!)

● NumericRangeQuery

● Useful links

Page 16: Numeric Range Queries in Lucene and Solr

Under the hood - Index

Page 17: Numeric Range Queries in Lucene and Solr

Under the hood - Index

NumericTokenStream is where half of magic happens! ● precision step = 1● value = 11

● Let’s see how it will be indexed!

00000000 00000000 00000000 00001011

Page 18: Numeric Range Queries in Lucene and Solr

Under the hood - Index

Field with precisionStep=1

Page 19: Numeric Range Queries in Lucene and Solr

Under the hood - Index00001011

00001010

00001000

11

10 = 5 << 1

8 = 2 << 2

shift=0

shift=1

shift=2

00001000 8 = 1 << 3shift=3

00000000 0 = 0 << 4shift=4

continue…00000000 0 = 0 << 5shift=5

Page 20: Numeric Range Queries in Lucene and Solr

Under the hood - Index

11111111 11111111 11111111 11111111

Algorithm requires to index all 32/precisionStep terms

So, for “11” we have 11, 10, 8, 8, 0, 0, 0, 0, 0….0

How much for an integer?

Page 21: Numeric Range Queries in Lucene and Solr

Under the hood - Index

Okay! We indexed 32 tokens for the field. (TermDictionary! Postings!) Where is the trick?

Stay tuned!

Page 22: Numeric Range Queries in Lucene and Solr

Under the hood - Query

Page 23: Numeric Range Queries in Lucene and Solr

Under the hood - Query

Sub-classes of FieldType could override #getRangeQuery(...) to provide their own range query implementation.

If not, then likely you will have:MultiTermQuery rangeQuery = TermRangeQuery.newStringRange(...)

TrieField overrides it. And here comes...

Page 24: Numeric Range Queries in Lucene and Solr

Agenda:

● What is RangeQuery

● Which field type to use for Numerics

● Range stuff under the hood (run!)

● NumericRangeQuery

● Useful links

Page 25: Numeric Range Queries in Lucene and Solr

Numeric Range Query (Decimal)● Decimal example, precisionStep = ten● q = price:[423 TO 642]

Page 26: Numeric Range Queries in Lucene and Solr

Numeric Range Query (Binary)● precisionStep = 1● q = price:[3 TO 12]

1 2 3 4 5 6 7 8 9 10 11 12 130

Page 27: Numeric Range Queries in Lucene and Solr

Numeric Range Query (Binary)● precisionStep = 1● q = price:[3 TO 12]

1 2 3 4 5 6 7 8 9 10 11 12 130

0 1 2 3 4 5 6

SHIFT = 1

Page 28: Numeric Range Queries in Lucene and Solr

Numeric Range Query (Binary)● precisionStep = 1● q = price:[3 TO 12]

1 2 3 4 5 6 7 8 9 10 11 12 130

0 1 2 3 4 5 6

0 1 2 3

0 1

0

0

0

...

Page 29: Numeric Range Queries in Lucene and Solr

Numeric Range Query (Binary)● precisionStep = 1● q = price:[3 TO 12]

1 2 3 4 5 6 7 8 9 10 11 12 130

0 1 2 3 4 5 6

0 1 2 3

0 1

0

Page 30: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)

So, the questions is: How to create query for the algorithm?

Page 31: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)Let’s come back to TrieField#getRangeQuery(...)

There are several options:● field is multiValued, hasDocValues, not indexed

○ super#getRangeQuery● field is hasDocValues, not indexed

○ new ConstantScoreQuery ( FieldCacheRangeFilter.newIntRange(...) )

● otherwise ta-da○ NumericRangeQuery.newIntRange(...)

Page 32: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)NumericRangeQuery extends MultiTermQuery which is:

An abstract Query that matches documents containing a subset of terms provided by a

FilteredTermsEnum enumeration.

This query cannot be used directly(abstract); you must subclass it and define getTermsEnum(Terms, AttributeSource) to provide a FilteredTermsEnum

that iterates through the terms to be matched.

Page 33: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)Let’s understand how #getTermsEnum works. Returns new NumericRangeTermsEnum(...)

The main part is: NumericUtils.splitIntRange(...)

Page 34: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)Algorithm uses binary masks very much:

1 2 30

0 1

for (int shift=0; noRanges(); shift += precisionStep): diff = 1L << (shift + precisionStep); mask = ((1L << precisionStep) - 1L) << shift;

Diff is distance between upper level neighborsMask is to check if currentLevel node has nodes lower or upper. (1,3 hasLower, 0,2 hasUpper)

diff=2

Page 35: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)hasLower = (minBound & mask) != 0L;hasUpper = (maxBound & mask) != mask;

if (hasLower)addRange(builder, valSize, minBound, minBound | mask, shift);

if (hasUpper)addRange(builder, valSize, maxBound & ~mask, maxBound, shift);

Page 36: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)hasLower = (minBound & mask) != 0L;hasUpper = (maxBound & mask) != mask;

nextMinBound = (hasLower ? (minBound + diff) : minBound) & ~mask;

nextMaxBound = (hasUpper ? (maxBound - diff) : maxBound) & ~mask;

Page 37: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)// If we are in the lowest precision or the next precision is not available.

addRange(builder, valSize, minBound, maxBound, shift);

// exit the split recursion loop (FOR)

Page 38: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)● shift = 0● diff = 0b00000010 = 2● mask = 0b00000001 = 1● hasLower = (3 & 1 != 0)? = true● hasUpper = (12 & 1 != 1)? = true

○ addRange 3..(3 | 1) = 3..3○ addRange 12..(12 & ~1) = 12..12

● nextMin = (3 + 2) & ~1 = 4● nextMax = (12 - 2) & ~1 = 10

1 2 3 4 5 6 7 8 9 10 11 12 130

Page 39: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)

1 2 3 4 5 6 7 8 9 10 11 12 130

0 1 2 3 4 5 6

● min:4; max:10● shift = 1● diff = 0b00000100 = 4● mask = 0b00000010 = 2● hasLower = (4 & 2 != 0) ? = false● hasUpper = (10 & 2 != 2) ? = false● nextMin = min● nextMax = max

Page 40: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)

1 2 3 4 5 6 7 8 9 10 11 12 130

0 1 2 3 4 5 6

0 1 2 3

● min:4; max:10● shift = 2● diff = 0b00001000 = 8● mask = 0b00000100 = 4● hasLower = (4 & 4 != 0) ? = true● hasUpper = (10 & 4 != 4) ? = true● nextMin = (4 + 8) & ~4 = 8 => min > max END● nextMax = (10 - 8) & ~4 = 0 => range 1..2 shift =

2

Page 41: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)TestNumericUtils#testSplitIntRange

assertIntRangeSplit(lower, upper, precisionStep, expectBounds, shifts)assertIntRangeSplit(3, 12, 1, true, Arrays.asList( -2147483645,-2147483645, // 3,3 -2147483636,-2147483636, // 12,12 536870913, 536870914), // 1, 2 for shift == 2 Arrays.asList(0, 0, 2)); // Crappy unsigned int conversions are done in the asserts

Page 42: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)

So, NumericTermsEnum generates and remembers all ranges to match.

Page 43: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)Basically TermsEnum is an Iterator to seek or step through terms in some order.In our case order is:

Then (shift = 1):

Then (shift = 2)

...

1 2 3 4 5 6 7 8 9 10 11 12 130

0 1 2 3 4 5 6

0 1 2 3

Page 44: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)Actually we have FilteredTermsEnum:

1. Only red terms are accepted by our enumerator2. If term is not accepted we advance:

FilteredTermsEnum#nextSeekTerm(currentTerm)TermsEnum#seekCeil(termToSeek)

Seek term depends on currentTerm and generated ranges.

Page 45: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)Ok, now we have TermsEnum for MiltiTermQuery and enum is able to seek through only those terms which match appropriate sub ranges.

The question is how to convert TermsEnum to Query!?

Page 46: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)The last trick is query#rewrite() method of MultiTermQuery (rewrite is always called on query before performing search):

public final Query rewrite(IndexReader reader) { return rewriteMethod.rewrite(reader, this); }

Oh, “rewriteMethod” how interesting… It defines how the query is rewritten.

Page 47: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)There are plenty of different rewrite methods, but most interesting for us are:●

CONSTANT_SCORE_*○ BOOLEAN_QUERY_REWRITE○ FILTER_REWRITE○ AUTO_REWRITE_DEFAULT

Page 48: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)

BOOLEAN_QUERY_REWRITE

1. Collect terms (TermCollector) by using #getTermsEnum(...)

2. For each term create TermQuery3. return BooleanQuery with all TermQuery as leafs

Page 49: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)

FILTER_REWRITE

1. Get termsEnum by using #getTermsEnum(...)2. Create FixedBitSet3. Get DocsEnum for each term4. Iterate over docs and bitSet.set(docid);5. return ConstantScoreQuery over filter (bitSet)

Page 50: Numeric Range Queries in Lucene and Solr

Numeric Range Query (How?)

AUTO_REWRITE_DEFAULT

If the number of documents to be visited in the postings exceeds some percentage of the maxDoc()

for the index then FILTER_REWRITE is used, otherwise BOOLEAN_REWRITE is used.

Page 51: Numeric Range Queries in Lucene and Solr

Agenda:

● ..

● I promised. Precision Step!

● ...

Page 52: Numeric Range Queries in Lucene and Solr

Precision stepSo, what is precision step and how it affects performance?

● Defines how much terms to index for each value○ Lower step values mean more precisions and

consequently more terms in index○ indexedTermsPerValue = bitsPerVal / pStep○ Lower precision terms are non unique, so term

dictionary doesn’t grow much, however postings file does

Page 53: Numeric Range Queries in Lucene and Solr

Precision stepSo, what is precision step and how it affects performance?

● ...○ Smaller precision step means less number of

terms to match, which optimizes query speed○ But more terms to seek in index○ You can index with a lower precision step value

and test search speed using a multiple of the original step value.

○ Ideal step is found by testing only

Page 54: Numeric Range Queries in Lucene and Solr

Precision step (Results)According to NumericRangeQuery javadoc:

● Opteron64 machine, Java 1.5, 8 bit precision step● 500k docs index● TermRangeQuery in BooleanRewriteMode took

about 30-40 seconds● TermRangeQuery in FilterRewriteMode took

about 5 seconds● NumericRangeQuery took < 100ms

Page 55: Numeric Range Queries in Lucene and Solr

Agenda:

● What is RangeQuery

● Which field type to use for Numerics

● Range stuff under the hood (run!)

● NumericRangeQuery

● Useful links

Page 57: Numeric Range Queries in Lucene and Solr