82
Searching with Solr Tom Hill [email protected] eBig Java SIG, June 18th, 2008

An Introduction to Solr

  • Upload
    tomhill

  • View
    25.437

  • Download
    4

Embed Size (px)

DESCRIPTION

A brief introduction to using Apache Solr for implementing search for your website. Download the ppt to see comments which add more detail. Presented at eBig Java SIG, Oakland, CA. June 2008

Citation preview

Page 1: An Introduction to Solr

Searching with SolrTom Hill

[email protected] Java SIG, June 18th, 2008

Page 2: An Introduction to Solr

Tonight's Talk

•Tonight's Talk should run about 1 1/2 hours

•About Solr

•Background & overview

•Installing & Bringing Up Solr

•Rest Interface & Java Client

•Configuring Solr

Page 3: An Introduction to Solr

Why Implement Search?

•Does your site need search?

•Do you need to implement it, or is Google enough?

• Just text or Structured Data?

• Do you need to control ranking?

Page 4: An Introduction to Solr

What is Solr?

•Web application for text search

•A wrapper around Apache Lucene

•Lucene is a library (.jar file)

•Solr is a web app (.war file)

•Written at CNet, now at Apache

Page 5: An Introduction to Solr

What is Lucene?

•Text search library in Java

•Fast, feature rich.

•Written by Doug Cutting

•Active Apache development community

•Versions also in C++, C#, Ruby, Python, Delphi, Lisp, etc...

Page 6: An Introduction to Solr

Why Solr?

•Reliable

•Fast

•Supported

•Open Source

•Tunable Scoring

Page 7: An Introduction to Solr

Solr Versions

•Current Version is 1.2

• A year old

• 1.3 is coming "sometime"

•Large number of features in HEAD

• Use the latest from subversion for new projects

Page 8: An Introduction to Solr

Alternatives to Solr

•Just Use Google

•Use Lucene

•Use Your Database

•Commercial Libraries

•Write your own

Page 9: An Introduction to Solr

What Solr is Not

•A replacement for a relational database

•An embedded database*

•Fully cross platform :-(

• Replication depends on unix FS

• Admin scripts are bash(minor)

Page 10: An Introduction to Solr

Solr Sites

•CNet (Reviews & Products)

•Internet Archive (Collections)

•Netflix (Movies)

•Zvents (Events)

•StripSearch.ws (Comics)

•And many more

Page 11: An Introduction to Solr

Features

Here's a quick look at some of the features of Solr,as implemented on Zvents.com

Page 12: An Introduction to Solr
Page 13: An Introduction to Solr

Faceted Navigation

Groups the results by categoryCan do multiple facets at once Returns matching counts

Page 14: An Introduction to Solr

Additional Constraints

Page 15: An Introduction to Solr

Synonyms, etc.

Page 16: An Introduction to Solr

Solr Overview

Page 17: An Introduction to Solr

Simple WebappWeb Servers[1..n]Web Servers[1..n]Database MasterDatabase Master

Database Slaves[0..n]Database Slaves[0..n]

Solr MasterSolr Master

Solr Slaves[0..n]Solr Slaves[0..n]

Page 18: An Introduction to Solr

Scaling Solr

•Master/Slave architecture

•Writes to master/reads to slaves

•Replication: Periodic transfers, not continuous

•Rsync

Page 19: An Introduction to Solr

Updates

•Updates flush caches, bad for performance

•Master therefor much slower than slaves

• So send all queries to slaves

•Depends on your update rates

Page 20: An Introduction to Solr

Solr's Data Model

• Solr maintains a collection of documents

• A document is a collection of fields & values

• A field can occur multiple times in a document

• Documents are immutable.

• They can be deleted, and a new version added, however.

Page 21: An Introduction to Solr

Querying

•Http request

• http://localhost:8080/comix/select/?q=java

Page 22: An Introduction to Solr

Solr Query Syntax

•Lucene Query Syntax + a bit

•paris

•city:paris

•title:"The Right Way" AND text:go

•id:[* TO *]

Page 23: An Introduction to Solr

Solr Query Syntax II

•-inStock:false

•te?t

•theat*

•te*t

•test~

Page 24: An Introduction to Solr

Using Solr

•Getting data into Solr

•Getting data out of Solr

Page 25: An Introduction to Solr

Getting Data Into Solr•POST it.

<add> <doc> <field name="employeeId">05991</field> <field name="office">Bridgewater</field> <field name="skills">Perl</field> <field name="skills">Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]]</add>

Page 26: An Introduction to Solr

Getting Data Into Solr•POST it.

<add> <doc> <field name="employeeId">05991</field> <field name="office">Bridgewater</field> <field name="skills">Perl</field> <field name="skills">Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]]</add>

Page 27: An Introduction to Solr

Getting Data Into Solr•POST it.

<add> <doc> <field name="employeeId">05991</field> <field name="office">Bridgewater</field> <field name="skills">Perl</field> <field name="skills">Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]]</add>

Page 28: An Introduction to Solr

Committing

•Nothing shows up in the index until you commit

•You can just POST <commit/> to http://host:port/solr/update

Page 29: An Introduction to Solr

Getting Data Outhttp://localhost:8080/comix/select/?q=data&indent=on<response><lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="indent">on</str> <str name="q">data</str> </lst></lst><result name="response" numFound="2" start="0"> <doc> <str name="id">strip.3136</str> <str name="release_date">1992-05-07</str> <date name="timestamp">2008-02-28T10:06:01.682Z</date> <str name="type">strip</str> </doc> </result></response>

Page 30: An Introduction to Solr

Getting Data Outhttp://localhost:8080/comix/select/?q=data&indent=on<response><lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="indent">on</str> <str name="q">data</str> </lst></lst><result name="response" numFound="2" start="0"> <doc> <str name="id">strip.3136</str> <str name="release_date">1992-05-07</str> <date name="timestamp">2008-02-28T10:06:01.682Z</date> <str name="type">strip</str> </doc> </result></response>

Page 31: An Introduction to Solr

Getting Data Outhttp://localhost:8080/comix/select/?q=data&indent=on<response><lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="indent">on</str> <str name="q">data</str> </lst></lst><result name="response" numFound="2" start="0"> <doc> <str name="id">strip.3136</str> <str name="release_date">1992-05-07</str> <date name="timestamp">2008-02-28T10:06:01.682Z</date> <str name="type">strip</str> </doc> ... </result></response>

Page 32: An Introduction to Solr

Getting Data Outhttp://localhost:8080/comix/select/?q=data&indent=on{ "responseHeader":{ "status":0, "QTime":1, "params":{

"wt":"json","rows":["1", "1"],"start":"0","indent":"on","q":"data","version":"2.2"}},

"response":{"numFound":2,"start":0,"docs":[{ "feature_id":"3", "release_date":"1992-05-07", "id":"strip.3136", "timestamp":"2008-02-28T10:06:01.682Z"}]

}}

JSON format

Page 33: An Introduction to Solr

Debug Query Option

•Add &debugQuery=on to request params

•Returns parsed form of query

<str name="rawquerystring">c.i.a</str><str name="querystring">c.i.a</str><str name="parsedquery">PhraseQuery(text:"c i a")</str><str name="parsedquery_toString">text:"c i a"</str>

Page 34: An Introduction to Solr

Debug Query Option II

•Add &debugQuery=on to request params

•Returns scoring information

<str name="id=strip.2781,internal_docid=29854"> 2.6219895 = (MATCH) fieldWeight(text:calvin in 29854), product of: 1.0 = tf(termFreq(text:calvin)=1) 2.6219895 = idf(docFreq=6222) 1.0 = fieldNorm(field=text, doc=29854)</str> <str name="id=strip.4078,internal_docid=31151"> 2.6219895 = (MATCH) fieldWeight(text:calvin in 31151), product of: 1.0 = tf(termFreq(text:calvin)=1) 2.6219895 = idf(docFreq=6222) 1.0 = fieldNorm(field=text, doc=31151)</str>

Page 35: An Introduction to Solr

Deleting Data

<delete><id>35</id></delete><delete><query>city:paris</query></delete>

•POST

Page 36: An Introduction to Solr

Command Line Control

curl http://localhost:8983/solr/update -H "Content-type: text/xml" --data-binary '<commit/>'<?xml version="1.0" encoding="UTF-8"?><response><lst name="responseHeader"><int name="status">0</int>

<int name="QTime">20</int></lst></response>

Page 37: An Introduction to Solr

Solr in 3 minutes!

•Download Solr from Apache•Untar•"ant example"•Start the example app•Load data into Solr•Query

Page 38: An Introduction to Solr

Solr in Ten Minutes

<Context docBase="/var/solr/apache-solr-1.2.0.war" debug="0" crossContext="true" > <Environment name="solr/home" type="java.lang.String" value="/var/solr" override="true" /></Context>

•Copy Solr's example/solr dir to /var/solr •Edit schema.xml and solrconfig.xml•Load data into Solr

•In $CATALINA_HOME/conf/Catalina/localhost/foo.xml

Page 39: An Introduction to Solr

Directory Layout

•${solr.home}/conf

•schema.xml

•solrconfig.xml

•${solr.home}/data

•${solr.home}/logs

•${solr.home}/bin

Page 40: An Introduction to Solr

Java Solr Client•Called SolrJ

•Not in Solr 1.2.

• I grabbed from the HEAD from svn

• Works with Solr 1.2

•Add/Delete/Query/Commit/Optimize

Page 41: An Introduction to Solr

Adding Docs w/SolrJGiven Map<String, String> fields;

CommonsHttpSolrServer server = new CommonsHttpSolrServer(url);

SolrInputDocument doc=new SolrInputDocument();for (Map.Entry<String, String> e :fields.entrySet()){ doc.addField(e.getKey(), e.getValue());}

UpdateResponse res = server.add( doc);

Page 42: An Introduction to Solr

Deleting Docs w/SolrJ

CommonsHttpSolrServer server = new CommonsHttpSolrServer(url);

UpdateResponse res;

res =server.deleteById("100");res =server.deleteByQuery("city:paris");

Page 43: An Introduction to Solr

Simple QueryCommonsHttpSolrServer server= new CommonsHttpSolrServer(url);

SolrQuery query = new SolrQuery();

query.setQuery("dance");

QueryResponse rsp = server.query(query);

Page 44: An Introduction to Solr

More Interesting Query

CommonsHttpSolrServer server = new CommonsHttpSolrServer(url);

SolrQuery query = new SolrQuery();query.setQuery("dance");query.setFacet(true);query.addFacetField("city");query.setFacetMinCount(1);query.addSortField( "price", SolrQuery.ORDER.asc );QueryResponse rsp = server.query(query);

Page 45: An Introduction to Solr

Query Responses

QueryResponse qr = server.query(query);SolrDocumentList docs = qr.getResults();List<FacetField> lf = qr.getFacetFields();for (FacetField ff: lf) { String fieldName = ff.getName(); List<FacetField.Count> lc = ff.getValues(); for (FacetField.Count c: lc) { String countName = c.getName();long count = c.getCount();

}}

Page 46: An Introduction to Solr

Other Commands

•Commit•server.commit()

•Optimize•server.optimize()

•Not too complicated!

Page 47: An Introduction to Solr

Request Handlers•Request handler define how the

query is processed.

•Two main types

•StandardRequestHandler

•DisMaxRequestHandler

•You can implement your own

•Changing in Solr 1.3

Page 48: An Introduction to Solr

"Standard" Request Handler

•Accepts Solr Query Syntax

•I tend to use it for my queries, not user queries.

Page 49: An Introduction to Solr

DisMaxRequestHandler

•Recommended for user queries

•Allows simple users keywords to be applied to multiple fields, with weighting.

•Boost Functions

•Boost Queries

Page 50: An Introduction to Solr

Boost Functions

•Allow you to influence scoring at run time

•Computationally Expensive!

•Really useful for tuning scoring

•linear(x,2,4) returns 2*x+4

• x is a field

Page 51: An Introduction to Solr

The Solr Schema

•schema.xml

•Defines types used in this webapp

•Defines the fields and their types

•Defines "copyFields"

•READ THE EXAMPLE SCHEMA.XML

Page 52: An Introduction to Solr

Types•Types define processing for a field

•How the words are split (Whitespace? Punctuation? CIA != C.I.A.)

•Stemming

•Case Folding, etc

•Predefined date, int, float, etc

•c

Page 53: An Introduction to Solr

Analysis: Index and Query Time

•Types have two modes

•Index Time

•Query Time

Page 54: An Introduction to Solr

Simple Text Field

<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/></analyzer><analyzer type="query"><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/></analyzer></fieldType>

Page 55: An Introduction to Solr

Analysis & Facets

•Make sure to use an untokenized field for faceting.

•"San Jose" != "San" "Jose"

Page 56: An Introduction to Solr

Fields

•Elements of a document

•Both predefined & dynamic

•Fields may occur multiple times

•Maybe indexed and/or stored

Page 57: An Introduction to Solr

Example Fields

<field name="id" type="string" indexed="true" stored="true" required="true" /><field name="name" type="text" indexed="true" stored="true"/><field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>

Page 58: An Introduction to Solr

Copy Fields

•Two main uses

•To analyze a field in two different ways

•To concatenate fields

Page 59: An Introduction to Solr

The Solr Config File

•solrconfig.xml

•Defines request handlers, defaults, caches,

•Read the example solrconfig.xml

Page 60: An Introduction to Solr

Configuring DisMax

•Parameter defaults set in solrconfig.xml

•Can be overridden in each request

•Except for params labeled invariant

Page 61: An Introduction to Solr

DisMax Config Example

<requestHandler name="dismax" class="solr.DisMaxRequestHandler" > <lst name="defaults"> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 </str> <str name="pf"> text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9 </str>... </requestHandler>

Page 62: An Introduction to Solr

DisMax Config Example

<requestHandler name="dismax" class="solr.DisMaxRequestHandler" > ... <str name="bf"> ord(poplarity)^0.5 recip(rord(price),1,1000,1000)^0.3 </str> <str name="fl"> id,name,price,score </str>... </requestHandler>

Page 63: An Introduction to Solr

DisMax Config Example

<requestHandler name="dismax" class="solr.DisMaxRequestHandler" > ... <str name="mm"> 2&lt;-1 5&lt;-2 6&lt;90% </str> <int name="ps">100</int> <str name="q.alt">*:*</str> </lst> </requestHandler>

Page 64: An Introduction to Solr

Wrap Up

Page 65: An Introduction to Solr

Resources

• Solr http://lucene.apache.org/solr/

• wiki, mailing list, jira (bugs/features)

• Lucene http://lucene.apache.org/

Page 66: An Introduction to Solr

Lucene In Action

Page 67: An Introduction to Solr

Building Search Applications with Lucene, lingpipe and Gate

Manu Konchady

Page 68: An Introduction to Solr

Other Presentations

• Yonik Seely's Solr & Lucene

• http://people.apache.org/~yonik/presentations/

• Slideshare.net

• Search for solr, or search for lucene

Page 69: An Introduction to Solr

Thanks!

Thanks for coming.

Feel free to email me if you have questions about Solr

Tom [email protected]

Page 70: An Introduction to Solr

Extra Slides

Things I didn't have time for in the presentation.Some of them unfinished.

Page 71: An Introduction to Solr

Search Engines are not the Same as

Users

•Search engines have different usage patterns than users

Page 73: An Introduction to Solr

Explain

•Just why did the documents come up in that order?

Page 74: An Introduction to Solr

Data Matters

•Gigo

•The better the data is, the better the search will be.

Page 75: An Introduction to Solr

Watch Your Caches

•Just like any other app, check your statistics

•What's the hit rate for your caches?

Page 76: An Introduction to Solr

Setting Up Replication

•Run rsyncd on the master

•Run snapshot on the master at intervals

•Run snappuller on the slaves at (different) intervals.

•Scripts don't print errors!

•Check the logs

•Use bash -xv

Page 77: An Introduction to Solr

Autowarming•Runs after an update to the index

•Updates flush caches

•Runs some queries to populate caches again

•Can be a problem, with frequent updates

•Don't autowarm master, if updating lots

Page 78: An Introduction to Solr

Tour Of Solr's Web UI

Page 79: An Introduction to Solr

Programming Collective Intelligence

A Really Fun Book

Page 80: An Introduction to Solr

Geographic Searching•Local Lucene & Local

Solr•http://locallucene.wiki.sourceforg

e.net•There's also geolucene, but it's not being actively developed, as far as I can tell.•http://www.gossamer-threads.com/lists/lucene/java-dev/53378

Page 81: An Introduction to Solr

http://localhost:8983/solr/admin/stats.jsp#update

Are there commits pending?

Page 82: An Introduction to Solr

http://localhost:8983/comix/admin/analysis.jsp?name=text&val=wi-fi

Analysis Explanation