50
Solr Search at the Speed of Light JavaZone 2009 September 10 Oslo Erik Hatcher, Lucid Imagination [email protected] 1

Solr: Search at the Speed of Light

Embed Size (px)

DESCRIPTION

Erik Hatcher's JavaZone '09 slides for "Solr: Search at the Speed of Light"

Citation preview

Page 1: Solr: Search at the Speed of Light

SolrSearch at the Speed of Light

JavaZone 2009September 10

OsloErik Hatcher, Lucid Imagination

[email protected]

1

Page 2: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

• Created by Yonik Seeley for CNET

• Contributed to Apache in January 2006

• December 2006: Version 1.1 released

• June 2007: Version 1.2 released

• September 2008: Version 1.3 released

• ~September 2009: Version 1.4

Solr History

http://lucene.apache.org/solr

2

Page 3: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Solr

DB

Data

Search Results

DocumentDocumentDocuments

Solr: Big Picture

3

Page 4: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

• Lucene power exposed over HTTP

• Scalability: caching, replication, distributed search

• Faceting

• And more: spell checking, highlighting, clustering, rich document and DB indexing, "more like this"

Features

4

Page 5: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Lucene

• Fast, scalable search library

• Lucene index structure

• Index contains documents

• documents have fields

• indexed fields have terms

5

Page 6: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Inverted Index

• Commonly used search engine data structure

• Efficient lookup of terms across large number of documents

• Usually stores positional information to enable phrase/proximity queries

From "Taming Text" by Grant Ingersoll and Tom Morton

6

Page 7: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Analysis Process

7

Page 8: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Analyzing the analyzer

The quick brown fox jumps over the lazy dog.

Example phrase

8

Page 9: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

WhitespaceAnalyzer

The quick brown fox jumps over the lazy dog.

[The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.]

Simplest built-in analyzer

9

Page 10: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

SimpleAnalyzer

the quick brown fox jumps over the lazy dog.

[the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]

Lowercases, splits at non-letter boundaries

10

Page 11: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

StopAnalyzer

The quick brown fox jumps over the lazy dog.

[quick] [brown] [fox] [jumps] [over] [lazy] [dog]

Lowercases and removes stop words

11

Page 12: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

SnowballAnalyzer

The quick brown fox jumps over the lazi dog.

[the] [quick] [brown] [fox] [jump] [over] [the] [lazi] [dog]

Stemming algorithm

12

Page 13: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

What's in a token?

13

Page 14: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Relevance

• Term frequency (TF): number of times a term appears in a document

• Inverse document frequency (IDF): One over number of times term appears in the index (1/df)

• Field length normalization: control affect field length, in number of terms, has on score

• Boost factors: terms, fields, or documents

14

Page 15: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

q1

d1

Θ

Lucene Scoring

15

Page 16: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

• HTTP GET/POST (curl or any other HTTP client)

• JSON

• SolrJ (embedded or HTTP)

• solr-ruby

• python, PHP, solrsharp, XSLT

Solr APIs

16

Page 17: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Solr in ProductionIncoming Search

Requests

Load Balancer

Solr MasterSolr Master

Solr

ShardShard Master

ReplicantReplicant

ReplicantReplicant

Load Balancer

Shard Request Shard Request

1..nshards

ShardShard Master

ReplicantReplicant

ReplicantReplicant

Load Balancer

17

Page 18: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Getting Started:It's This Easy

1.Start Solr

java -jar start.jar

2.Index your data

java -jar post.jar *.xml

3.Search

http://localhost:8983/solr18

Page 19: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Configuration• schema.xml

• field types and fields

• solrconfig.xml

• request handler mappings

• cache settings: filter, query, document

• warming listeners

• HTTP cache settings

• Lucene index parameters

• plugins: spell checking, highlighting

19

Page 20: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

<add><doc> <field name="id">MA147LL/A</field> <field name="name">Apple 60 GB iPod with Video Playback Black</field> <field name="manu">Apple Computer Inc.</field> <field name="cat">electronics</field> <field name="cat">music</field> <field name="features">iTunes, Podcasts, Audiobooks</field> <field name="features">Stores up to 15,000 songs, 25,000 photos, or 150 hours of video</field> <field name="features">2.5-inch, 320x240 color TFT LCD display with LED backlight</field> <field name="features">Up to 20 hours of battery life</field> <field name="features">Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video</field> <field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field> <field name="includes">earbud headphones, USB cable</field> <field name="weight">5.5</field> <field name="price">399.00</field> <field name="popularity">10</field> <field name="inStock">true</field></doc></add>

Solr add/update XML

20

Page 21: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Indexing Solr XML

• Via curl:curl 'http://localhost:8983/solr/update?commit=true' --data-binary @ipod_video.xml -H 'Content-type:text/xml; charset=utf-8'

• Via Solr's Java-based post tool:java -jar post.jar ipod_video.xml

21

Page 22: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

curl 'http://localhost:8983/solr/update/csv?commit=true' --data-binary @books.csv -H 'Content-type:text/plain; charset=utf-8'

Indexing CSV

22

Page 23: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Content Streams

• Allows Solr server to fetch local or remote data itself. Must enable remote streaming in solrconfig.xml

• http://localhost:8983/solr/update?stream.file=<local Solr path to exampledocs>/ipod_video.xml

• &stream.url=<url to content>

• Security warning: allows Solr to fetch arbitrary server-side file or network URL content

23

Page 24: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Indexing Rich Documents

curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true&extractOnly=true&wt=ruby&indent=on' -F "[email protected]"

24

Page 25: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

SolrServer solr = new CommonsHttpSolrServer(new URL("http://localhost:8983/solr"));

SolrInputDocument doc = new SolrInputDocument();doc.addField("id", "JAVAZONE_09");doc.addField("title", "JavaZone 2009 SolrJ Example"); solr.add(doc);solr.commit(); // after a batch, not per documentsolr.optimize(); // periodically, when needed

Indexing with SolrJ

25

Page 26: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

solr = Connection.new( 'http://localhost:8983/solr', :autocommit => :on)

solr.add(:id => 123, :title => 'Solr in Action')

solr.optimize # periodically, as needed

Indexing with Ruby

26

Page 27: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Data Import Handler

• Indexes relational database, XML data sources, e-mail, and more

• Supports full and incremental/delta indexing

• Extensible with custom data sources, transformers, etc

• http://wiki.apache.org/solr/DataImportHandler

27

Page 28: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

DB Indexing

http://localhost:8983/solr/db/dataimport?command=full-import

28

Page 29: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

• http://localhost:8983/solr/select?q=query

• &start=50

• &rows=25

• &fq=filter+query

• &facet=on&facet.field=category

Example Search Request

29

Page 30: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Debug Query

• &debugQuery=true is your friend

• Includes parsed query, explanations, and search component timings in response

30

Page 31: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Query Parser

• Controlled by defType parameter

• &defType=lucene (actually a Solr extension of Lucene’s QueryParser)

• &defType=dismax

• Local {!..} override syntax

31

Page 32: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Solr Query Parser

• http://lucene.apache.org/java/2_4_0/queryparsersyntax.html + Solr extensions

• Kitchen sink parser, includes advanced user-unfriendly syntax

• Syntax errors throw parse exceptions back to client

• Example: title:ipod* AND price:[0 TO 100]

32

Page 33: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Dismax Query Parser

• Simplified syntax: loose text “quote phrases” -prohibited +required

• Spreads query terms across query fields (qf) with dynamic boosting per field, implicit phrase construction (pf), boosting function (bf), boosting query (bq), and minimum match (mm)

33

Page 34: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr");

SolrQuery params = new SolrQuery("author:John");params.setFields("*,score");params.setRows(3);QueryResponse response = server.query(params);for (SolrDocument document : response.getResults()) { System.out.println("Doc: " + document);}

Searching with SolrJ

34

Page 35: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

conn = Connection.new( 'http://localhost:8983/solr')

conn.query('my query') do |hit| puts hit.inspectend

Searching with Ruby

35

Page 36: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

• Delete:

• <delete><id>05991</id></delete>

• <delete> <query>category:Unused</query></delete>

• java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"

• Update: simply <add> doc with same unique key

• Commit: <commit/>

• Optimize: <optimize/>

delete, update, etc

36

Page 37: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

• Counts per subset within results

• Facet on: field terms, queries, date ranges

• &facet=on&facet.field=cat&facet.query=price:[0 TO 100]

• http://wiki.apache.org/solr/SimpleFacetParameters

Faceting

37

Page 38: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

• Not enabled by default, see example config to wire it in

• http://localhost:8983/solr/spell?q=epod&spellcheck=on&spellcheck.build=true

• File or index-based dictionaries

• Supports pluggable distance algorithms: Levenstein and JaroWinkler

• http://wiki.apache.org/solr/SpellCheckComponent

Spell checking

38

Page 39: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

• http://localhost:8983/solr/select?q=ipod&hl=on&hl.fl=manu,name

• http://wiki.apache.org/solr/HighlightingParameters

Highlighting

39

Page 40: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

More Like This

• http://localhost:8983/solr/select?q=ipod&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1&fl=id,score,name

• http://wiki.apache.org/solr/MoreLikeThis

40

Page 41: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Scaling: Query Throughput

• Replication

• slaves poll master for index updates

• transfers index files from master to slave

• configuration files can also be transferred

• entirely Java/HTTP-based in Solr 1.4 (prior versions used rsync)

41

Page 42: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Scaling: Collection Size

• Distribution

• Index documents across shards

• query single server with shards parameter

• sends requests to each shard

• aggregates result to a single response

42

Page 43: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Solr-powered UI

• Solritas (from "celeritas"): VelocityResponseWriter

• easily templated output

• SolrJS: jQuery-based widgets

• see http://solrjs.solrstuff.org/

• Blacklight and Flare: RoR plugins

43

Page 44: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Lucene in Action, 2nd Edition

http://www.manning.com/lucene

44

Page 45: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Search at Lucidhttp://search.lucidimagination.com/?q=javazone

45

Page 46: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.!"#$%&'()*$+),$-+.&'+#/

!"#$%&'()*$+),$-+&$0&,12&#-((23#$)4&2+,$,5&-6 78)#12&

!"#2+29:-43&2#-050,2(

!"#$%&,2)(&$+#4"%20&,12&4)3*20,&#-442#,$-+&-6&

!"#2+29:-43&#-(($,,230.&#-+,3$;",-30&)+%&$+64"2+#230&

<"3&($00$-+&$0&,-&023=2&)0&!"#$%#&'#($)*$+,-#..#&-#$6-3&

!"#2+29:-43>;)02%&02)3#1&0-4",$-+0

?248&-"3&#"0,-(230&*2,&,12&(-0,&-",&-6&!"#2+29:-43&> !"#$%&'(

(-0,&@$%245&"02%&-82+&0-"3#2&02)3#1&0-6,@)32&&&

A&BCCD>BCCE

/")$/#$0(#

!"#$%$&'()*+',%-'./$0+'*)1)2',+$'.+,-$3,+42')5'./$'67,#/$'()5.8,+$'9)"%-,.0)%

46

Page 47: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.!"#$%&'()*$+),$-+.&'+#/

Unique

Combination of

Enterprise Search

and Lucene

Expertise

!"#$%&'()*$+),$-+&./#0+$#)1&./)(

! 2-+$3&4//1/56

012),-1&-3&4-51&&

!"#2+264-51&#-(($,,21.&780&(2(921

! 78)+,&'+*/89-116

!"#$%&"'&(')*+,#-#'.&&%'!$/01

!"#2+264-51&#-(($,,21.&0:)$1.&780

! :8$3&;),#0/86

0-;$+%2&"'&(')*+,#-#'3-'4,%3&-1'5&&6

!"#2+264-51&#-(($,,21.&780&(2(921

! <)83&<$11/8

!"#2+264-51&#-(($,,21.&780&(2(921

! 4)($&4$8/+

<",#:6=$>)&#-(($,,21.&780&(2(921

! =+%8>/?&@$1)1/#3$&

!"#2+26<",#:6?)%--@&#-(($,,21.&780&(2(921&

! A-"*&B",,$+*6&C=%D$9-8E

012),-1&-3&!"#2+2.&<",#:&A&?)%--@

B&CDDE;CDDF

! <)8#&F8/11/+9,/$+6

0-;3-"+%21.&0=G64H7.&<-1,:21+&!$*:,

H7&42)1#:.&0=G.&I5J2K$21

! @8$)+&G$+3/8,-+6

L2K25-@2%&M2901)N521.&,:2&N29OJ&3$1J,&#-(@12:2+J$K2&J2)1#:&2+*$+2&

71$+#$@)5&P1#:$,2#,&),&PF

! 4$(-+&H-9/+,0)16

4-5",$-+J&)1#:$,2#,.&<-1,:21+&!$*:,

! I)5&;$116

4-5",$-+J&P1#:$,2#,.&M255J&Q)1*-

! H5)+&<#F$+1/56

!"#2+264-51&#-(($,,21.&&780&(2(921

! B08$9&;-9,/,,/86&C=%D$9-8E

!"#2+264-51&#-(($,,21.&&780&(2(921

82(921&P@)#:2&4-3,N)12&Q-"+%),$-+

47

Page 48: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

!"#$%&'()*$+),$-+&."/$+0//&1-%02

!"#$$%&#$$'

()*+,-,./+"0+,/.1)2+,*.3.+4"5./*,.67*.1)/

& 8,++"&

2199+,:.;<""=7--1,*>" ?,;.).)@>" 21)/7<*.)@"

3)2"04)%%&567

89*:)%0

;:00

<-=+2-)%

!"#0+0

A7:.4"B9;@.);*.1) 21)3.4+)*.;<

>9)#?0@-:*

!"#$%$&'()*+',%-'./$0+'*)1)2',+$'.+,-$3,+42')5'./$'67,#/$'()5.8,+$'9)"%-,.0)%

48

Page 49: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.

Thank you

http://www.lucidimagination.com49

Page 50: Solr: Search at the Speed of Light

© 2008-2009 Lucid Imagination, Inc.50