Upload
erik-hatcher
View
103
Download
0
Embed Size (px)
DESCRIPTION
"Solr Powered Lucene", presented to the Northern VA Java Users Group on December 16, 2009
Citation preview
Solr Powered Lucene
Erik HatcherLucid Imagination
Northern Virginia Java Users GroupDecember 16, 2009
1
Erik Hatcher
• Member of Technical Staff, Lucid Imagination
• Apache Lucene/Solr Committer
• Member, Apache Software Foundation
• Co-author, Lucene in Action and Java Development with Ant (Manning)
2
A word from our sponsor...
• Pizza!
• Lucid Imagination
• commercial entity exclusively dedicated to Apache Lucene/Solr open source search technology
• Services: Technical Support, Expert Link, Training, Consulting
• Tools: LucidGaze for Lucene and Solr
• Free certified distributions of Solr and Lucene
• more to come...
3
What is Solr?
• Search server
• Built upon Apache Lucene (Java)
• Fast, very
• Scalable, query load and collection size
• Interoperable
• Extensible
4
5
Solr Example• Start Solr
• java -jar start.jar (Apache Solr distro)
• lucidworks start (Lucid certified distro)
• java -jar post.jar *.xml
• HTML view (via Solritas)
• easily enabled in Apache distro
• built-in to Lucid certified distro
6
• Created by Yonik Seeley for CNET
• Contributed to Apache in January 2006
• December 2006: Version 1.
• June 2007: Version 1.2
• September 2008: Version 1.3
• November 2009: Version 1.4
Solr History
7
• Lucene power exposed over HTTP
• Spell checking, highlighting, more-like-this
• Caching
• Replication
• Faceting
• Distributed search
Features
8
• HTTP GET/POST (curl or any other HTTP client)
• JSON
• SolrJ (embedded or HTTP)
• Ruby: solr-ruby, RSolr, etc
• Many others: python, PHP, solrsharp, XSLT
Solr APIs
9
Deployment Architecture
• Scales from:
• single Solr server
• master/replicants(slaves)
• distributed shards
• Each Solr instance can also have multiple cores
10
Lucene Fundamentals
11
• Index
• Document
• Field
• Terms (aka Tokens)
Concepts
12
Inverted Index
• Commonly used search engine data structure
• Efficient lookup of terms across large number of documents
• Usually stores positional information to enable phrase/proximity queries
From "Taming Text" by Grant Ingersoll and Tom Morton
13
14
What's in a token?
15
Lucene Scoring
q1
d1
Θ
16
Relevance
• Term frequency (TF): number of times a term appears in a document
• Inverse document frequency (IDF): One over number of times term appears in the index (1/df)
• Field length normalization: control affect field length, in number of terms, has on score
• Boost factors: terms, fields, or documents
17
Solr Core
• single primary index
• schema.xml / solrconfig.xml
• (optionally) multiple cores per Solr instance, configured in solr.xml
• other configuration and data files
18
• Field types
• Fields
• Unique key (optional*)* I've never seen a case that didn't require a unique identifier per document
• copy fields
• similarity and Solr query parser configuration
schema.xml
19
Schema Analysis
• http://localhost:8983/solr/admin/analysis.jsp
• Document analysis request handler:curl http://localhost:8983/solr/analysis/document --data-binary @ipod_video.xml -H 'Content-type:text/xml; charset=utf-8'
• Field analysis request handler:http://localhost:8983/solr/analysis/field?analysis.fieldtype=text&analysis.fieldvalue=Foo%20Bar&q=foo&analysis.showmatch=true
20
• Lucene indexing parameters
• Cache settings
• Request handler configuration
• HTTP cache settings
• Search components, response writers, query parsers
solrconfig.xml
21
• mini-“servlets”
• SearchHandler extensions chain search components
• Flexible response formatting:
• &wt=[json, ruby, xslt, php, phps, javabin, python,velocity]
Request handlers
22
<add><doc> <field name="id">MA147LL/A</field> <field name="name">Apple 60 GB iPod with Video Playback Black</field> <field name="manu">Apple Computer Inc.</field> <field name="cat">electronics</field> <field name="cat">music</field> <field name="features">iTunes, Podcasts, Audiobooks</field> <field name="features">Stores up to 15,000 songs, 25,000 photos, or 150 hours of video</field> <field name="features">2.5-inch, 320x240 color TFT LCD display with LED backlight</field> <field name="features">Up to 20 hours of battery life</field> <field name="features">Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video</field> <field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field> <field name="includes">earbud headphones, USB cable</field> <field name="weight">5.5</field> <field name="price">399.00</field> <field name="popularity">10</field> <field name="inStock">true</field></doc></add>
Solr XML
23
Indexing Solr XML
• Via curl:curl 'http://localhost:8983/solr/update?commit=true' --data-binary @ipod_video.xml -H 'Content-type:text/xml; charset=utf-8'
• Via Solr's Java-based post tool:java -jar post.jar ipod_video.xml
24
curl 'http://localhost:8983/solr/update/csv?commit=true' --data-binary @books.csv -H 'Content-type:text/plain; charset=utf-8'
Indexing CSV
25
Content Streams
• Allows Solr server to fetch local or remote data itself. Must enable remote streaming in solrconfig.xml
• http://localhost:8983/solr/update
• ?stream.file=<local Solr path to exampledocs>/ipod_video.xml
• ?stream.url=<url to content>
• Security warning: allows Solr to fetch arbitrary server-side file or network URL content
26
SolrServer solr = new CommonsHttpSolrServer( new URL("http://localhost:8983/solr"));
SolrInputDocument doc = new SolrInputDocument();doc.addField("id", "EXAMPLEDOC01");doc.addField("title", "NOVAJUG SolrJ Example"); solr.add(doc);
solr.commit(); // after a batch, not per document
solr.optimize(); // periodically, if/when needed
Indexing with SolrJ
27
solr = Connection.new( 'http://localhost:8983/solr', :autocommit => :on
solr.add(:id => 123, :title => 'Solr in Action')
solr.optimize # periodically, as needed
Indexing with solr-ruby
28
• Delete:
• <delete><id>05991</id></delete>
• <delete> <query>category:Unused</query></delete>
• java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"
• Update: simply <add> doc with same unique key
• <commit/> pending documents
• <optimize/> index, squeezes out deleted documents, collapses segments
• <rollback/> to last commit point
delete, update, etc
Update commands via GET: http://localhost:8983/solr/update?stream.body=<commit/>
29
Data Import Handler
• Indexes relational database, XML data, and e-mail sources
• Supports full and incremental/delta indexing
• Highly extensible with custom data sources, transformers, etc
• http://wiki.apache.org/solr/DataImportHandler
30
DIH details
• Put JDBC driver JAR in <solr-home>/lib, configure dataimport request handler
• http://localhost:8983/solr/db/admin/dataimport.jsp - debugging console
• http://localhost:8983/solr/db/dataimport?command=full-import - removes all documents and imports from scratch
31
Solr Cell
• aka ExtractingRequestHandler
• leveraging Tika, extracts and indexes rich documents such as Word, PDF, HTML, and many other types
• curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "[email protected]"
• http://wiki.apache.org/solr/ExtractingRequestHandler
32
• http://localhost:8983/solr/select?q=query
Standard Search Request
33
Debug Query
• &debugQuery=true is your friend
• Includes parsed query, explanations, and search component timings in response
34
• Send GET HTTP requests
• http://localhost:8983/solr/select?q=solr&start=0&rows=10&fl=id,name
• start: zero-based starting result
• rows: number of hits to return
• fl: list of stored fields to return
Searching
35
Query Parser
• Controlled by defType parameter
• &defType=lucene (actually a Solr extension of Lucene’s QueryParser)
• &defType=dismax
• Local {!...} override syntax
36
Solr Query Parser
• http://lucene.apache.org/java/2_9_1/queryparsersyntax.html+ Solr extensions
• Kitchen sink parser, includes advanced user-unfriendly syntax
• Syntax errors throw parse exceptions back to client
• Example: title:ipod* AND price:[0 TO 100]
• http://wiki.apache.org/solr/SolrQuerySyntax
37
Dismax Query Parser
• Simplified syntax: loose text “quote phrases” -prohibited +required
• Spreads query terms across query fields (qf) with dynamic boosting per field, phrase construction (pf), and boosting query and function capabilities (bq and bf)
38
SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr");
SolrQuery params = new SolrQuery("author:John");params.setFields("*,score");params.setRows(3);
QueryResponse response = server.query(params);
for (SolrDocument document : response.getResults()) { System.out.println("Doc: " + document);}
Searching with SolrJ
39
conn = Connection.new( 'http://localhost:8983/solr')
conn.query('my query') do |hit| puts hit.inspectend
Searching with Ruby
40
Built-in search components
• Standard: query, facet, mlt, highlight, stats, debug
• Others: elevation, clustering, term, term vector
41
• Counts per subset within results
• Facet on: field terms, queries, date ranges
• &facet=on&facet.field=cat&facet.query=price:[0 TO 100]
• http://wiki.apache.org/solr/SimpleFacetParameters
Faceting
42
• http://localhost:8983/solr/spell?q=epod&spellcheck=on&spellcheck.build=true
• File or index-based dictionaries
• Supports pluggable distance algorithms: Levenstein and JaroWinkler
• http://wiki.apache.org/solr/
Spell checking
43
• http://localhost:8983/solr/select?q=apple&hl=on&hl.fl=*
• http://wiki.apache.org/solr/HighlightingParameters
Highlighting
44
More Like This
• http://localhost:8983/solr/select?q=ipod&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1&fl=id,score,name
• http://wiki.apache.org/solr/MoreLikeThis
45
Query Elevation
• http://localhost:8983/solr/elevate?q=ipod&debugQuery=true&enableElevation=true
• Configure an “elevate.xml” to boost/exclude specific documents
• http://wiki.apache.org/solr/QueryElevationComponent
46
Clustering
• Dynamic grouping of documents into labeled sets
• http://localhost:8983/solr/clustering?q=*:*&rows=10
• http://wiki.apache.org/solr/ClusteringComponent
• Requires additional steps to install (see documentation) with Apache Solr distro
47
Terms
• Enumerates terms from specified fields
• http://localhost:8983/solr/terms?terms.fl=name&terms.sort=index&terms.prefix=vi
48
Term Vectors
• Details term vector information: term frequency, document frequency, position and offset information
• http://localhost:8983/solr/select/?q=*%3A*&qt=tvrh&tv=true&tv.all=true
49
stats.jsp
• Not technically a “request handler”, outputs only XML
• http://localhost:8983/solr/admin/stats.jsp
• Index stats such as number of documents, searcher open time
• Request handler details, number of requests and errors, average request time,
50
Replication
• Master is polled
• Replicant pulls Lucene index and optionally also Solr configuration files
• Query throughput scaling: replicate and load balance
• http://wiki.apache.org/solr/SolrReplication
51
Distributed Search
• Distribute documents to same-schema shards
• Scaling for when single index becomes too large, or a single query becomes too slow
• http://wiki.apache.org/solr/DistributedSearch
52
What’s new in Solr 1.4?• Java-based replication
• VelocityResponseWriter (Solritas)
• AJAX-Solr
• Logging switched to SLF4J
• Rollback, since last commit
• StatsComponent
• TermVectorComponent
• Configurable Directory provider
53
Lucene 2.9
• IndexReader#reopen()
• Faster filter performance, by 300% in some cases
• Per-segment FieldCache
• Reusable token streams
• Faster numeric/date range queries, thanks to trie
• and tons more, see Lucene 2.9's CHANGES.txt
54
Performance Improvements
• Caching
• Concurrent file access
• Per-segment index updates
• Faceting
• DocSet generation, avoids scoring
• Streaming updates for SolrJ
55
Feature Improvements• Rich document
indexing
• DataImportHandler enhancements
• Smoother replication
• More choices for logging
• Multi-select faceting
• Speedier range queries
• Duplicate detection
• New request handler components
56
• http://wiki.apache.org/solr
• Lucid Imagination
• http://www.lucidimagination.com
• Articles, webinars, blogs, and...
• Search the Lucene ecosystem at:http://search.lucidimagination.com
Resources
57
Shout Out
58
e-book now available!print coming soon
http://www.manning.com/lucene
59
LucidWorks for Solr• Certified Distribution
• Value-added integration
• KStemmer
• Carrot2 clustering
• LucidGaze for Solr
• installer
• Reference Manual
• Solr 1.4 certified distro coming soon!
60
LucidGaze for Solr
• Monitoring tool, captures, stores, and interactively views Solr performance metrics
• requests/second
• time/request
61
62
LucidFind
http://search.lucidimagination.com/?q=novajug
63
64