Upload
erik-hatcher
View
102
Download
0
Embed Size (px)
DESCRIPTION
Erik Hatcher's JavaZone '09 slides for "Solr: Search at the Speed of Light"
Citation preview
SolrSearch at the Speed of Light
JavaZone 2009September 10
OsloErik Hatcher, Lucid Imagination
1
© 2008-2009 Lucid Imagination, Inc.
• Created by Yonik Seeley for CNET
• Contributed to Apache in January 2006
• December 2006: Version 1.1 released
• June 2007: Version 1.2 released
• September 2008: Version 1.3 released
• ~September 2009: Version 1.4
Solr History
http://lucene.apache.org/solr
2
© 2008-2009 Lucid Imagination, Inc.
Solr
DB
Data
Search Results
DocumentDocumentDocuments
Solr: Big Picture
3
© 2008-2009 Lucid Imagination, Inc.
• Lucene power exposed over HTTP
• Scalability: caching, replication, distributed search
• Faceting
• And more: spell checking, highlighting, clustering, rich document and DB indexing, "more like this"
Features
4
© 2008-2009 Lucid Imagination, Inc.
Lucene
• Fast, scalable search library
• Lucene index structure
• Index contains documents
• documents have fields
• indexed fields have terms
5
© 2008-2009 Lucid Imagination, Inc.
Inverted Index
• Commonly used search engine data structure
• Efficient lookup of terms across large number of documents
• Usually stores positional information to enable phrase/proximity queries
From "Taming Text" by Grant Ingersoll and Tom Morton
6
© 2008-2009 Lucid Imagination, Inc.
Analysis Process
7
© 2008-2009 Lucid Imagination, Inc.
Analyzing the analyzer
The quick brown fox jumps over the lazy dog.
Example phrase
8
© 2008-2009 Lucid Imagination, Inc.
WhitespaceAnalyzer
The quick brown fox jumps over the lazy dog.
[The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.]
Simplest built-in analyzer
9
© 2008-2009 Lucid Imagination, Inc.
SimpleAnalyzer
the quick brown fox jumps over the lazy dog.
[the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]
Lowercases, splits at non-letter boundaries
10
© 2008-2009 Lucid Imagination, Inc.
StopAnalyzer
The quick brown fox jumps over the lazy dog.
[quick] [brown] [fox] [jumps] [over] [lazy] [dog]
Lowercases and removes stop words
11
© 2008-2009 Lucid Imagination, Inc.
SnowballAnalyzer
The quick brown fox jumps over the lazi dog.
[the] [quick] [brown] [fox] [jump] [over] [the] [lazi] [dog]
Stemming algorithm
12
© 2008-2009 Lucid Imagination, Inc.
What's in a token?
13
© 2008-2009 Lucid Imagination, Inc.
Relevance
• Term frequency (TF): number of times a term appears in a document
• Inverse document frequency (IDF): One over number of times term appears in the index (1/df)
• Field length normalization: control affect field length, in number of terms, has on score
• Boost factors: terms, fields, or documents
14
© 2008-2009 Lucid Imagination, Inc.
q1
d1
Θ
Lucene Scoring
15
© 2008-2009 Lucid Imagination, Inc.
• HTTP GET/POST (curl or any other HTTP client)
• JSON
• SolrJ (embedded or HTTP)
• solr-ruby
• python, PHP, solrsharp, XSLT
Solr APIs
16
© 2008-2009 Lucid Imagination, Inc.
Solr in ProductionIncoming Search
Requests
Load Balancer
Solr MasterSolr Master
Solr
ShardShard Master
ReplicantReplicant
ReplicantReplicant
Load Balancer
Shard Request Shard Request
1..nshards
ShardShard Master
ReplicantReplicant
ReplicantReplicant
Load Balancer
17
© 2008-2009 Lucid Imagination, Inc.
Getting Started:It's This Easy
1.Start Solr
java -jar start.jar
2.Index your data
java -jar post.jar *.xml
3.Search
http://localhost:8983/solr18
© 2008-2009 Lucid Imagination, Inc.
Configuration• schema.xml
• field types and fields
• solrconfig.xml
• request handler mappings
• cache settings: filter, query, document
• warming listeners
• HTTP cache settings
• Lucene index parameters
• plugins: spell checking, highlighting
19
© 2008-2009 Lucid Imagination, Inc.
<add><doc> <field name="id">MA147LL/A</field> <field name="name">Apple 60 GB iPod with Video Playback Black</field> <field name="manu">Apple Computer Inc.</field> <field name="cat">electronics</field> <field name="cat">music</field> <field name="features">iTunes, Podcasts, Audiobooks</field> <field name="features">Stores up to 15,000 songs, 25,000 photos, or 150 hours of video</field> <field name="features">2.5-inch, 320x240 color TFT LCD display with LED backlight</field> <field name="features">Up to 20 hours of battery life</field> <field name="features">Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video</field> <field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field> <field name="includes">earbud headphones, USB cable</field> <field name="weight">5.5</field> <field name="price">399.00</field> <field name="popularity">10</field> <field name="inStock">true</field></doc></add>
Solr add/update XML
20
© 2008-2009 Lucid Imagination, Inc.
Indexing Solr XML
• Via curl:curl 'http://localhost:8983/solr/update?commit=true' --data-binary @ipod_video.xml -H 'Content-type:text/xml; charset=utf-8'
• Via Solr's Java-based post tool:java -jar post.jar ipod_video.xml
21
© 2008-2009 Lucid Imagination, Inc.
curl 'http://localhost:8983/solr/update/csv?commit=true' --data-binary @books.csv -H 'Content-type:text/plain; charset=utf-8'
Indexing CSV
22
© 2008-2009 Lucid Imagination, Inc.
Content Streams
• Allows Solr server to fetch local or remote data itself. Must enable remote streaming in solrconfig.xml
• http://localhost:8983/solr/update?stream.file=<local Solr path to exampledocs>/ipod_video.xml
• &stream.url=<url to content>
• Security warning: allows Solr to fetch arbitrary server-side file or network URL content
23
© 2008-2009 Lucid Imagination, Inc.
Indexing Rich Documents
curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true&extractOnly=true&wt=ruby&indent=on' -F "[email protected]"
24
© 2008-2009 Lucid Imagination, Inc.
SolrServer solr = new CommonsHttpSolrServer(new URL("http://localhost:8983/solr"));
SolrInputDocument doc = new SolrInputDocument();doc.addField("id", "JAVAZONE_09");doc.addField("title", "JavaZone 2009 SolrJ Example"); solr.add(doc);solr.commit(); // after a batch, not per documentsolr.optimize(); // periodically, when needed
Indexing with SolrJ
25
© 2008-2009 Lucid Imagination, Inc.
solr = Connection.new( 'http://localhost:8983/solr', :autocommit => :on)
solr.add(:id => 123, :title => 'Solr in Action')
solr.optimize # periodically, as needed
Indexing with Ruby
26
© 2008-2009 Lucid Imagination, Inc.
Data Import Handler
• Indexes relational database, XML data sources, e-mail, and more
• Supports full and incremental/delta indexing
• Extensible with custom data sources, transformers, etc
• http://wiki.apache.org/solr/DataImportHandler
27
© 2008-2009 Lucid Imagination, Inc.
DB Indexing
http://localhost:8983/solr/db/dataimport?command=full-import
28
© 2008-2009 Lucid Imagination, Inc.
• http://localhost:8983/solr/select?q=query
• &start=50
• &rows=25
• &fq=filter+query
• &facet=on&facet.field=category
Example Search Request
29
© 2008-2009 Lucid Imagination, Inc.
Debug Query
• &debugQuery=true is your friend
• Includes parsed query, explanations, and search component timings in response
30
© 2008-2009 Lucid Imagination, Inc.
Query Parser
• Controlled by defType parameter
• &defType=lucene (actually a Solr extension of Lucene’s QueryParser)
• &defType=dismax
• Local {!..} override syntax
31
© 2008-2009 Lucid Imagination, Inc.
Solr Query Parser
• http://lucene.apache.org/java/2_4_0/queryparsersyntax.html + Solr extensions
• Kitchen sink parser, includes advanced user-unfriendly syntax
• Syntax errors throw parse exceptions back to client
• Example: title:ipod* AND price:[0 TO 100]
32
© 2008-2009 Lucid Imagination, Inc.
Dismax Query Parser
• Simplified syntax: loose text “quote phrases” -prohibited +required
• Spreads query terms across query fields (qf) with dynamic boosting per field, implicit phrase construction (pf), boosting function (bf), boosting query (bq), and minimum match (mm)
33
© 2008-2009 Lucid Imagination, Inc.
SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr");
SolrQuery params = new SolrQuery("author:John");params.setFields("*,score");params.setRows(3);QueryResponse response = server.query(params);for (SolrDocument document : response.getResults()) { System.out.println("Doc: " + document);}
Searching with SolrJ
34
© 2008-2009 Lucid Imagination, Inc.
conn = Connection.new( 'http://localhost:8983/solr')
conn.query('my query') do |hit| puts hit.inspectend
Searching with Ruby
35
© 2008-2009 Lucid Imagination, Inc.
• Delete:
• <delete><id>05991</id></delete>
• <delete> <query>category:Unused</query></delete>
• java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"
• Update: simply <add> doc with same unique key
• Commit: <commit/>
• Optimize: <optimize/>
delete, update, etc
36
© 2008-2009 Lucid Imagination, Inc.
• Counts per subset within results
• Facet on: field terms, queries, date ranges
• &facet=on&facet.field=cat&facet.query=price:[0 TO 100]
• http://wiki.apache.org/solr/SimpleFacetParameters
Faceting
37
© 2008-2009 Lucid Imagination, Inc.
• Not enabled by default, see example config to wire it in
• http://localhost:8983/solr/spell?q=epod&spellcheck=on&spellcheck.build=true
• File or index-based dictionaries
• Supports pluggable distance algorithms: Levenstein and JaroWinkler
• http://wiki.apache.org/solr/SpellCheckComponent
Spell checking
38
© 2008-2009 Lucid Imagination, Inc.
• http://localhost:8983/solr/select?q=ipod&hl=on&hl.fl=manu,name
• http://wiki.apache.org/solr/HighlightingParameters
Highlighting
39
© 2008-2009 Lucid Imagination, Inc.
More Like This
• http://localhost:8983/solr/select?q=ipod&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1&fl=id,score,name
• http://wiki.apache.org/solr/MoreLikeThis
40
© 2008-2009 Lucid Imagination, Inc.
Scaling: Query Throughput
• Replication
• slaves poll master for index updates
• transfers index files from master to slave
• configuration files can also be transferred
• entirely Java/HTTP-based in Solr 1.4 (prior versions used rsync)
41
© 2008-2009 Lucid Imagination, Inc.
Scaling: Collection Size
• Distribution
• Index documents across shards
• query single server with shards parameter
• sends requests to each shard
• aggregates result to a single response
42
© 2008-2009 Lucid Imagination, Inc.
Solr-powered UI
• Solritas (from "celeritas"): VelocityResponseWriter
• easily templated output
• SolrJS: jQuery-based widgets
• see http://solrjs.solrstuff.org/
• Blacklight and Flare: RoR plugins
43
© 2008-2009 Lucid Imagination, Inc.
Lucene in Action, 2nd Edition
http://www.manning.com/lucene
44
© 2008-2009 Lucid Imagination, Inc.
Search at Lucidhttp://search.lucidimagination.com/?q=javazone
45
© 2008-2009 Lucid Imagination, Inc.!"#$%&'()*$+),$-+.&'+#/
!"#$%&'()*$+),$-+&$0&,12&#-((23#$)4&2+,$,5&-6 78)#12&
!"#2+29:-43&2#-050,2(
!"#$%&,2)(&$+#4"%20&,12&4)3*20,&#-442#,$-+&-6&
!"#2+29:-43&#-(($,,230.&#-+,3$;",-30&)+%&$+64"2+#230&
<"3&($00$-+&$0&,-&023=2&)0&!"#$%#&'#($)*$+,-#..#&-#$6-3&
!"#2+29:-43>;)02%&02)3#1&0-4",$-+0
?248&-"3&#"0,-(230&*2,&,12&(-0,&-",&-6&!"#2+29:-43&> !"#$%&'(
(-0,&@$%245&"02%&-82+&0-"3#2&02)3#1&0-6,@)32&&&
A&BCCD>BCCE
/")$/#$0(#
!"#$%$&'()*+',%-'./$0+'*)1)2',+$'.+,-$3,+42')5'./$'67,#/$'()5.8,+$'9)"%-,.0)%
46
© 2008-2009 Lucid Imagination, Inc.!"#$%&'()*$+),$-+.&'+#/
Unique
Combination of
Enterprise Search
and Lucene
Expertise
!"#$%&'()*$+),$-+&./#0+$#)1&./)(
! 2-+$3&4//1/56
012),-1&-3&4-51&&
!"#2+264-51&#-(($,,21.&780&(2(921
! 78)+,&'+*/89-116
!"#$%&"'&(')*+,#-#'.&&%'!$/01
!"#2+264-51&#-(($,,21.&0:)$1.&780
! :8$3&;),#0/86
0-;$+%2&"'&(')*+,#-#'3-'4,%3&-1'5&&6
!"#2+264-51&#-(($,,21.&780&(2(921
! <)83&<$11/8
!"#2+264-51&#-(($,,21.&780&(2(921
! 4)($&4$8/+
<",#:6=$>)&#-(($,,21.&780&(2(921
! =+%8>/?&@$1)1/#3$&
!"#2+26<",#:6?)%--@&#-(($,,21.&780&(2(921&
! A-"*&B",,$+*6&C=%D$9-8E
012),-1&-3&!"#2+2.&<",#:&A&?)%--@
B&CDDE;CDDF
! <)8#&F8/11/+9,/$+6
0-;3-"+%21.&0=G64H7.&<-1,:21+&!$*:,
H7&42)1#:.&0=G.&I5J2K$21
! @8$)+&G$+3/8,-+6
L2K25-@2%&M2901)N521.&,:2&N29OJ&3$1J,&#-(@12:2+J$K2&J2)1#:&2+*$+2&
71$+#$@)5&P1#:$,2#,&),&PF
! 4$(-+&H-9/+,0)16
4-5",$-+J&)1#:$,2#,.&<-1,:21+&!$*:,
! I)5&;$116
4-5",$-+J&P1#:$,2#,.&M255J&Q)1*-
! H5)+&<#F$+1/56
!"#2+264-51&#-(($,,21.&&780&(2(921
! B08$9&;-9,/,,/86&C=%D$9-8E
!"#2+264-51&#-(($,,21.&&780&(2(921
82(921&P@)#:2&4-3,N)12&Q-"+%),$-+
47
© 2008-2009 Lucid Imagination, Inc.
!"#$%&'()*$+),$-+&."/$+0//&1-%02
!"#$$%&#$$'
()*+,-,./+"0+,/.1)2+,*.3.+4"5./*,.67*.1)/
& 8,++"&
2199+,:.;<""=7--1,*>" ?,;.).)@>" 21)/7<*.)@"
3)2"04)%%&567
89*:)%0
;:00
<-=+2-)%
!"#0+0
A7:.4"B9;@.);*.1) 21)3.4+)*.;<
>9)#?0@-:*
!"#$%$&'()*+',%-'./$0+'*)1)2',+$'.+,-$3,+42')5'./$'67,#/$'()5.8,+$'9)"%-,.0)%
48
© 2008-2009 Lucid Imagination, Inc.
Thank you
http://www.lucidimagination.com49
© 2008-2009 Lucid Imagination, Inc.50