Transcript
Page 1: An Introduction to Basics of Search and Relevancy with Apache Solr

Introduction to basics of Search and Relevancy with Apache Solr

Mark Bennett, CTO

FEATURING:

Page 2: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Agenda

• Prerequisites: Browser Tricks

• Web “Command Line”

• The DisMax Parser

• Boosting Formula

• Explaining “Explain”

• Check Your Index!

• Q & A

• Resources / About NIE

2

Page 3: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Prerequisite: Some Browser Tricks

3

Page 4: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Browsers Matter – install them all!

• Default XML Rendering

• (also some versions of IE)

• Lots of Plugins

• Better “Explain” copy & paste

maintains line breaks

• Better table copy and paste

Firefox: IE and Safari:

4

Page 5: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Larger Firefox “Command Line”

Customize the Firefox URL box as a commandline in 3 easy steps

1. Toolbar: Right Click

2. Customize… Add New Toolbar

3. URL bar ->CLICK and DRAG

5

Page 6: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Turn off Solr HTTP Caching

• Change in solrconfig.xml

• Disable the http304 section

• Turn it back on before you deploy!

6

Page 7: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Understanding Solr’s“Web Command Line”

7

Page 8: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

The “Web Command Line”

• Command Prompt

• -o or --foo bar

• (spaces)

• some punctuation

• output

• Command line “adapter”

• Script files can call URLs

• Not built into Windows – try cygwin

CLI CONCEPT SOLR EQUIVALENT

8

URL bar

XML or HTML

? or & and =

+

%nn

Curl

Page 9: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Solr “Command Line”

• Typical Base URL

• http://localhost:8983/solr/select?...

• Basic Input (not counting dismax)

• q = query, fq = filter query

• df = default field

• qt = query type (standard / dismax)

• Controlling Output (lots more!!!)

• debugQuery = true

• wt = “what type” (actually “writer type”)

• standard/XML, xslt (with tr=), javabin, json…

• fl = *,score (which fields)

9

Page 10: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Example: search for “solr”

http://localhost:8983/solr/select?q=solr&debugQuery=true

* Some versions

With Firefoxyou get XML output you can expand and collapse

With MSIE* and Safari, not so much

10

Page 11: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Detailed Debug & Explain Output

http://localhost:8983/solr/select?q=solr&debugQuery=true

<str name="parsedquery">text:solr</str> …

<lst name="explain">

<str name="SOLR1000">

0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:

1.4142135 = tf(termFreq(text:solr)=2)

3.6026897 = idf(docFreq=1, numDocs=26)

0.125 = fieldNorm(field=text, doc=13)

</str>

</lst>

11

Page 12: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

A look at the DisMax query parser

12

Page 13: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Solr DisMax: Defined

• What is it?

• Dis-joint text (Multiple fields)

• Max-imum match (score)

• How do you get it?

• Configured in:

• solrconfig.xml and schema.xml

• Called with:

• qt=dismax

• Adjusted with:

• mm, bf, qf, pf, qs, ps, tie

13

Page 14: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Solr DisMax: Pros and Cons

General Benefits

• Multiple Fields

• Multiple Relevancy Rules

• Great for Freshness / Popularity

Issues to be Aware of

• Tie-in between schema.xml & solrconfig.xml

• Trouble with some CJK (Chinese, Japanese, Korean)

• Limited wildcard / field / range support

• Difficult to customize and debug

• Trouble with shingles

• Understand mm!

14

Page 15: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

About the “dis” and the “max”

Distributed across multiple fields

• Breakup query into words

• Each part becomes field clause

• Like an OR but with extra credit

Takes the Maximum of each set

• Word 1 had highest score in Title

• Word 2 very dense in the doc body

• Adds in Tie breaker if in multiple fields

15

Page 16: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Coming soon: Extended DisMax

Improvements

• Flexible case Boolean ops: AND/and, OR/or

• Auto-escape punctuation & -> \&, etc.

• Improved Proximity Boosting (via word bigrams)

• Other changes in stop words, relevancy calc, URL arguments

How to get it

• Post 1.4 patch, planned for 1.5

• Details + Patch in JIRA: SOLR-1553

http://issues.apache.org/jira/browse/SOLR-1553

• TBD: change URL option qt=edismax (or qt=dismax )

16

Page 17: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Boosting Formulas

17

Page 18: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Boost Functions in Dismax

High Level Feature

• Numeric functions for scoring

• sum(), product(), sqrt(), log(), etc.

• Boost on recent dates, user popularity

Good Combination: Reverse-Ordinal & Reciprocal

• Position in index : ord(), reverse is: rord()

• Larger y for smaller x: recip()

How to get it

• URL parameter bf = “boost function”

• Configured in solrconfig.xml

• See http://wiki.apache.org/solr/FunctionQuery

18

Page 19: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

“Freshness”: Boosting Recent Datesm x + c a / mx+c

DatePosition

ord()N-Position

rord()Linear

(x,m,c) recip(x,m,a,c)

1/1/2000 1 120 1120 0.89286

2/1/2000 2 119 1119 0.89366

3/1/2000 3 118 1118 0.89445

… … … … …

1/1/2005 61 60 1060 0.94340

… … … … …

1/1/2009 109 12 1012 0.98814

2/1/2009 110 11 1011 0.98912

3/1/2009 111 10 1010 0.99010

4/1/2009 112 9 1009 0.99108

5/1/2009 113 8 1008 0.99206

6/1/2009 114 7 1007 0.99305

7/1/2009 115 6 1006 0.99404

8/1/2009 116 5 1005 0.99502

9/1/2009 117 4 1004 0.99602

10/1/2009 118 3 1003 0.99701

11/1/2009 119 2 1002 0.99800

12/1/2009 120 1 1001 0.99900

WIKI EXAMPLE:recip( rord(creationDate), 1, 1000, 1000 )

slope m 1

numerator a 1000

intercept c 1000 (aka "b")

0.880

0.900

0.920

0.940

0.960

0.980

1.000

19

Page 20: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Sifting throughSolr’s “Explain” output

20

Page 21: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

DisMax Example for “solr”

<str name="parsedquery">

+DisjunctionMaxQuery((id:solr^10.0 | text:solr^0.5 | cat:solr^1.4 | manu:solr^1.1 | name:solr^1.2 | features:solr | sku:solr^1.5)~0.01) DisjunctionMaxQuery((manu_exact:solr^1.9 | features:solr^1.1 | text:solr^0.2 | manu:solr^1.4 | name:solr^1.5)~0.01) FunctionQuery((top(ord(popularity)))^0.5) FunctionQuery((1000.0/(1.0*float(top(rord(price)))+1000.0))^0.3)

</str>

INPUT:

DEBUG OUTPUT: (1 OF 2)

http://localhost:8983/solr

/select?q=solr&debugQuery=true&qt=dismax

21

Page 22: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

DisMax explain output for a single word query

<lst name="explain"><str name="SOLR1000">

0.74609417 = (MATCH) sum of:0.4476144 = (MATCH) max plus 0.01 times others of:0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of:0.04119147 = queryWeight(text:solr^0.5), product of:0.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:1.4142135 = tf(termFreq(text:solr)=2)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=text, doc=13)

0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of:0.09885953 = queryWeight(name:solr^1.2), product of:1.2 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of:1.0 = tf(termFreq(name:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.5 = fieldNorm(field=name, doc=13)

0.03710002 = (MATCH) weight(features:solr in 13), product of:0.08238294 = queryWeight(features:solr), product of:3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:1.0 = tf(termFreq(features:solr)=1)

3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)

0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of:0.12357441 = queryWeight(sku:solr^1.5), product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of:1.0 = tf(termFreq(sku:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)1.0 = fieldNorm(field=sku, doc=13)

1.0 = tf(termFreq(features:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)

0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of:0.12357441 = queryWeight(sku:solr^1.5), product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of:1.0 = tf(termFreq(sku:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)1.0 = fieldNorm(field=sku, doc=13)

0.22311316 = (MATCH) max plus 0.01 times others of:0.040810023 = (MATCH) weight(features:solr^1.1 in 13),

product of:0.09062123 = queryWeight(features:solr^1.1), product of:1.1 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:

1.0 = tf(termFreq(features:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)

0.01049347 = (MATCH) weight(text:solr^0.2 in 13), product of:0.016476588 = queryWeight(text:solr^0.2), product of:0.2 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:1.4142135 = tf(termFreq(text:solr)=2)3.6026897 = idf(docFreq=1, numDocs=26)

0.125 = fieldNorm(field=text, doc=13)0.22260013 = (MATCH) weight(name:solr^1.5

in 13), product of:0.12357441 = queryWeight(name:solr^1.5),

product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

1.8013449 = (MATCH) fieldWeight(name:solrin 13), product of:

1.0 = tf(termFreq(name:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.5 = fieldNorm(field=name, doc=13)

0.06860119 = (MATCH) FunctionQuery(top(ord(popularity))), product of:

6.0 = ord(popularity)=60.5 = boost0.022867065 = queryNorm

0.0067654043 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(price)))+1000.0)), product of:

0.9861933 = 1000.0/(1.0*float(rord(price)=14)+1000.0)

0.3 = boost0.022867065 = queryNorm

</str></lst>

22

Page 23: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

“Explain” example:

...

0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of:

0.04119147 = queryWeight(text:solr^0.5), product of:

0.5 = boost

3.6026897 = idf(docFreq=1, numDocs=26)

0.022867065 = queryNorm

0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:

1.4142135 = tf(termFreq(text:solr)=2)

3.6026897 = idf(docFreq=1, numDocs=26)

0.125 = fieldNorm(field=text, doc=13)

0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of:

0.09885953 = queryWeight(name:solr^1.2), product of:

1.2 = boost

3.6026897 = idf(docFreq=1, numDocs=26)

0.022867065 = queryNorm

1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of:

1.0 = tf(termFreq(name:solr)=1)

3.6026897 = idf(docFreq=1, numDocs=26)

0.5 = fieldNorm(field=name, doc=13)

0.03710002 = (MATCH) weight(features:solr in 13), product of:

0.08238294 = queryWeight(features:solr), product of:

3.6026897 = idf(docFreq=1, numDocs=26)

0.022867065 = queryNorm

0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:

1.0 = tf(termFreq(features:solr)=1)

3.6026897 = idf(docFreq=1, numDocs=26)

0.125 = fieldNorm(field=features, doc=13)

...

tf (termFreq(text:solr )=2)idf (docFreq=1,numDocs=26)

23

Page 24: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Solr’s XSLT “debugger”http://localhost:8983/solr/select?

q=solr

&debugQuery=true

&wt=xslt

&tr=example.xsl

&fl=*,score

&qt=dismax

24

Page 25: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Another way to view Explain data

• Solr1.4 has Solritas

• Various features, including toggle explain display

• “Some assembly required…”

http://www.lucidimagination.com/blog/2009/11/04/solritas-solr-1-4s-hidden-gem/

25

Page 26: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Checking your Index and IDF

26

Page 27: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Checking what got Indexed

Bad Index = Bad Search

• Check Upper / lower case and Punctuation

• Bad Fields / Meta Data = Bad Facets, Filters, Sorting

Use built-in Schema Browser:

• Check each field

• Common words =

• IDF “Inverse Document Frequency”

27

Page 28: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Check IDF w/ the Schema Browser

Start at the Admin Screen:

Schema Browser

• select a field

• change # to see more

http://localhost:8983/solr/admin

Page 29: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

New Idea Engineering

About NIE

29

Page 30: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

NIE Resources

Search Dev Newsgroup:www.SearchDev.org

Newsletter & Whitepapers:www.ideaeng.com/current

EnterpriseSearchBlog.comBlogs:

SearchComponentsOnline.com

30

Page 31: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Finish Line / Q & A

Review & Questions

Mark Bennett [email protected]

main 408-446-3460

cell 408-829-6513

31

Page 32: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Q & A

These slides and a recorded presentation are available at

bit.ly/SolrRelevancy