32
Introduction to basics of Search and Relevancy with Apache Solr Mark Bennett, CTO FEATURING:

An Introduction to Basics of Search and Relevancy with Apache Solr

Embed Size (px)

DESCRIPTION

The open source Apache Solr open source search engine provides powerful, versatile search application development technology so you to take full control of your search needs. Solr’s rich interfaces and convenient server packaging of the underlying Apache Lucene search libraries into web service interfaces, and near limitless customizability let you take control of your search. From e-commerce to content management and endless variations in between, Solr is the right tool at the right time to turn ever growing volume and variety of data and documents to the advantage of your business.http://www.lucidimagination.com/blog/2009/12/01/webinar-an-introduction-to-basics-of-search-and-relevancy-with-apache-solr/

Citation preview

Page 1: An Introduction to Basics of Search and Relevancy with Apache Solr

Introduction to basics of Search and Relevancy with Apache Solr

Mark Bennett, CTO

FEATURING:

Page 2: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Agenda

• Prerequisites: Browser Tricks

• Web “Command Line”

• The DisMax Parser

• Boosting Formula

• Explaining “Explain”

• Check Your Index!

• Q & A

• Resources / About NIE

2

Page 3: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Prerequisite: Some Browser Tricks

3

Page 4: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Browsers Matter – install them all!

• Default XML Rendering

• (also some versions of IE)

• Lots of Plugins

• Better “Explain” copy & paste

maintains line breaks

• Better table copy and paste

Firefox: IE and Safari:

4

Page 5: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Larger Firefox “Command Line”

Customize the Firefox URL box as a commandline in 3 easy steps

1. Toolbar: Right Click

2. Customize… Add New Toolbar

3. URL bar ->CLICK and DRAG

5

Page 6: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Turn off Solr HTTP Caching

• Change in solrconfig.xml

• Disable the http304 section

• Turn it back on before you deploy!

6

Page 7: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Understanding Solr’s“Web Command Line”

7

Page 8: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

The “Web Command Line”

• Command Prompt

• -o or --foo bar

• (spaces)

• some punctuation

• output

• Command line “adapter”

• Script files can call URLs

• Not built into Windows – try cygwin

CLI CONCEPT SOLR EQUIVALENT

8

URL bar

XML or HTML

? or & and =

+

%nn

Curl

Page 9: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Solr “Command Line”

• Typical Base URL

• http://localhost:8983/solr/select?...

• Basic Input (not counting dismax)

• q = query, fq = filter query

• df = default field

• qt = query type (standard / dismax)

• Controlling Output (lots more!!!)

• debugQuery = true

• wt = “what type” (actually “writer type”)

• standard/XML, xslt (with tr=), javabin, json…

• fl = *,score (which fields)

9

Page 10: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Example: search for “solr”

http://localhost:8983/solr/select?q=solr&debugQuery=true

* Some versions

With Firefoxyou get XML output you can expand and collapse

With MSIE* and Safari, not so much

10

Page 11: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Detailed Debug & Explain Output

http://localhost:8983/solr/select?q=solr&debugQuery=true

<str name="parsedquery">text:solr</str> …

<lst name="explain">

<str name="SOLR1000">

0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:

1.4142135 = tf(termFreq(text:solr)=2)

3.6026897 = idf(docFreq=1, numDocs=26)

0.125 = fieldNorm(field=text, doc=13)

</str>

</lst>

11

Page 12: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

A look at the DisMax query parser

12

Page 13: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Solr DisMax: Defined

• What is it?

• Dis-joint text (Multiple fields)

• Max-imum match (score)

• How do you get it?

• Configured in:

• solrconfig.xml and schema.xml

• Called with:

• qt=dismax

• Adjusted with:

• mm, bf, qf, pf, qs, ps, tie

13

Page 14: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Solr DisMax: Pros and Cons

General Benefits

• Multiple Fields

• Multiple Relevancy Rules

• Great for Freshness / Popularity

Issues to be Aware of

• Tie-in between schema.xml & solrconfig.xml

• Trouble with some CJK (Chinese, Japanese, Korean)

• Limited wildcard / field / range support

• Difficult to customize and debug

• Trouble with shingles

• Understand mm!

14

Page 15: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

About the “dis” and the “max”

Distributed across multiple fields

• Breakup query into words

• Each part becomes field clause

• Like an OR but with extra credit

Takes the Maximum of each set

• Word 1 had highest score in Title

• Word 2 very dense in the doc body

• Adds in Tie breaker if in multiple fields

15

Page 16: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Coming soon: Extended DisMax

Improvements

• Flexible case Boolean ops: AND/and, OR/or

• Auto-escape punctuation & -> \&, etc.

• Improved Proximity Boosting (via word bigrams)

• Other changes in stop words, relevancy calc, URL arguments

How to get it

• Post 1.4 patch, planned for 1.5

• Details + Patch in JIRA: SOLR-1553

http://issues.apache.org/jira/browse/SOLR-1553

• TBD: change URL option qt=edismax (or qt=dismax )

16

Page 17: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Boosting Formulas

17

Page 18: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Boost Functions in Dismax

High Level Feature

• Numeric functions for scoring

• sum(), product(), sqrt(), log(), etc.

• Boost on recent dates, user popularity

Good Combination: Reverse-Ordinal & Reciprocal

• Position in index : ord(), reverse is: rord()

• Larger y for smaller x: recip()

How to get it

• URL parameter bf = “boost function”

• Configured in solrconfig.xml

• See http://wiki.apache.org/solr/FunctionQuery

18

Page 19: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

“Freshness”: Boosting Recent Datesm x + c a / mx+c

DatePosition

ord()N-Position

rord()Linear

(x,m,c) recip(x,m,a,c)

1/1/2000 1 120 1120 0.89286

2/1/2000 2 119 1119 0.89366

3/1/2000 3 118 1118 0.89445

… … … … …

1/1/2005 61 60 1060 0.94340

… … … … …

1/1/2009 109 12 1012 0.98814

2/1/2009 110 11 1011 0.98912

3/1/2009 111 10 1010 0.99010

4/1/2009 112 9 1009 0.99108

5/1/2009 113 8 1008 0.99206

6/1/2009 114 7 1007 0.99305

7/1/2009 115 6 1006 0.99404

8/1/2009 116 5 1005 0.99502

9/1/2009 117 4 1004 0.99602

10/1/2009 118 3 1003 0.99701

11/1/2009 119 2 1002 0.99800

12/1/2009 120 1 1001 0.99900

WIKI EXAMPLE:recip( rord(creationDate), 1, 1000, 1000 )

slope m 1

numerator a 1000

intercept c 1000 (aka "b")

0.880

0.900

0.920

0.940

0.960

0.980

1.000

19

Page 20: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Sifting throughSolr’s “Explain” output

20

Page 21: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

DisMax Example for “solr”

<str name="parsedquery">

+DisjunctionMaxQuery((id:solr^10.0 | text:solr^0.5 | cat:solr^1.4 | manu:solr^1.1 | name:solr^1.2 | features:solr | sku:solr^1.5)~0.01) DisjunctionMaxQuery((manu_exact:solr^1.9 | features:solr^1.1 | text:solr^0.2 | manu:solr^1.4 | name:solr^1.5)~0.01) FunctionQuery((top(ord(popularity)))^0.5) FunctionQuery((1000.0/(1.0*float(top(rord(price)))+1000.0))^0.3)

</str>

INPUT:

DEBUG OUTPUT: (1 OF 2)

http://localhost:8983/solr

/select?q=solr&debugQuery=true&qt=dismax

21

Page 22: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

DisMax explain output for a single word query

<lst name="explain"><str name="SOLR1000">

0.74609417 = (MATCH) sum of:0.4476144 = (MATCH) max plus 0.01 times others of:0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of:0.04119147 = queryWeight(text:solr^0.5), product of:0.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:1.4142135 = tf(termFreq(text:solr)=2)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=text, doc=13)

0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of:0.09885953 = queryWeight(name:solr^1.2), product of:1.2 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of:1.0 = tf(termFreq(name:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.5 = fieldNorm(field=name, doc=13)

0.03710002 = (MATCH) weight(features:solr in 13), product of:0.08238294 = queryWeight(features:solr), product of:3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:1.0 = tf(termFreq(features:solr)=1)

3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)

0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of:0.12357441 = queryWeight(sku:solr^1.5), product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of:1.0 = tf(termFreq(sku:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)1.0 = fieldNorm(field=sku, doc=13)

1.0 = tf(termFreq(features:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)

0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of:0.12357441 = queryWeight(sku:solr^1.5), product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of:1.0 = tf(termFreq(sku:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)1.0 = fieldNorm(field=sku, doc=13)

0.22311316 = (MATCH) max plus 0.01 times others of:0.040810023 = (MATCH) weight(features:solr^1.1 in 13),

product of:0.09062123 = queryWeight(features:solr^1.1), product of:1.1 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:

1.0 = tf(termFreq(features:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)

0.01049347 = (MATCH) weight(text:solr^0.2 in 13), product of:0.016476588 = queryWeight(text:solr^0.2), product of:0.2 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:1.4142135 = tf(termFreq(text:solr)=2)3.6026897 = idf(docFreq=1, numDocs=26)

0.125 = fieldNorm(field=text, doc=13)0.22260013 = (MATCH) weight(name:solr^1.5

in 13), product of:0.12357441 = queryWeight(name:solr^1.5),

product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm

1.8013449 = (MATCH) fieldWeight(name:solrin 13), product of:

1.0 = tf(termFreq(name:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.5 = fieldNorm(field=name, doc=13)

0.06860119 = (MATCH) FunctionQuery(top(ord(popularity))), product of:

6.0 = ord(popularity)=60.5 = boost0.022867065 = queryNorm

0.0067654043 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(price)))+1000.0)), product of:

0.9861933 = 1000.0/(1.0*float(rord(price)=14)+1000.0)

0.3 = boost0.022867065 = queryNorm

</str></lst>

22

Page 23: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

“Explain” example:

...

0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of:

0.04119147 = queryWeight(text:solr^0.5), product of:

0.5 = boost

3.6026897 = idf(docFreq=1, numDocs=26)

0.022867065 = queryNorm

0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:

1.4142135 = tf(termFreq(text:solr)=2)

3.6026897 = idf(docFreq=1, numDocs=26)

0.125 = fieldNorm(field=text, doc=13)

0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of:

0.09885953 = queryWeight(name:solr^1.2), product of:

1.2 = boost

3.6026897 = idf(docFreq=1, numDocs=26)

0.022867065 = queryNorm

1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of:

1.0 = tf(termFreq(name:solr)=1)

3.6026897 = idf(docFreq=1, numDocs=26)

0.5 = fieldNorm(field=name, doc=13)

0.03710002 = (MATCH) weight(features:solr in 13), product of:

0.08238294 = queryWeight(features:solr), product of:

3.6026897 = idf(docFreq=1, numDocs=26)

0.022867065 = queryNorm

0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:

1.0 = tf(termFreq(features:solr)=1)

3.6026897 = idf(docFreq=1, numDocs=26)

0.125 = fieldNorm(field=features, doc=13)

...

tf (termFreq(text:solr )=2)idf (docFreq=1,numDocs=26)

23

Page 24: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Solr’s XSLT “debugger”http://localhost:8983/solr/select?

q=solr

&debugQuery=true

&wt=xslt

&tr=example.xsl

&fl=*,score

&qt=dismax

24

Page 25: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Another way to view Explain data

• Solr1.4 has Solritas

• Various features, including toggle explain display

• “Some assembly required…”

http://www.lucidimagination.com/blog/2009/11/04/solritas-solr-1-4s-hidden-gem/

25

Page 26: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Checking your Index and IDF

26

Page 27: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Checking what got Indexed

Bad Index = Bad Search

• Check Upper / lower case and Punctuation

• Bad Fields / Meta Data = Bad Facets, Filters, Sorting

Use built-in Schema Browser:

• Check each field

• Common words =

• IDF “Inverse Document Frequency”

27

Page 28: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.

Check IDF w/ the Schema Browser

Start at the Admin Screen:

Schema Browser

• select a field

• change # to see more

http://localhost:8983/solr/admin

Page 29: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

New Idea Engineering

About NIE

29

Page 30: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

NIE Resources

Search Dev Newsgroup:www.SearchDev.org

Newsletter & Whitepapers:www.ideaeng.com/current

EnterpriseSearchBlog.comBlogs:

SearchComponentsOnline.com

30

Page 31: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Finish Line / Q & A

Review & Questions

Mark Bennett [email protected]

main 408-446-3460

cell 408-829-6513

31

Page 32: An Introduction to Basics of Search and Relevancy with Apache Solr

Lucid Imagination, Inc.12/2/2009

Q & A

These slides and a recorded presentation are available at

bit.ly/SolrRelevancy