64
Solr Powered Lucene Erik Hatcher Lucid Imagination [email protected] Northern Virginia Java Users Group December 16, 2009 1

Solr Powered Lucene

Embed Size (px)

DESCRIPTION

"Solr Powered Lucene", presented to the Northern VA Java Users Group on December 16, 2009

Citation preview

Page 1: Solr Powered Lucene

Solr Powered Lucene

Erik HatcherLucid Imagination

[email protected]

Northern Virginia Java Users GroupDecember 16, 2009

1

Page 2: Solr Powered Lucene

Erik Hatcher

• Member of Technical Staff, Lucid Imagination

• Apache Lucene/Solr Committer

• Member, Apache Software Foundation

• Co-author, Lucene in Action and Java Development with Ant (Manning)

2

Page 3: Solr Powered Lucene

A word from our sponsor...

• Pizza!

• Lucid Imagination

• commercial entity exclusively dedicated to Apache Lucene/Solr open source search technology

• Services: Technical Support, Expert Link, Training, Consulting

• Tools: LucidGaze for Lucene and Solr

• Free certified distributions of Solr and Lucene

• more to come...

3

Page 4: Solr Powered Lucene

What is Solr?

• Search server

• Built upon Apache Lucene (Java)

• Fast, very

• Scalable, query load and collection size

• Interoperable

• Extensible

4

Page 5: Solr Powered Lucene

5

Page 6: Solr Powered Lucene

Solr Example• Start Solr

• java -jar start.jar (Apache Solr distro)

• lucidworks start (Lucid certified distro)

• java -jar post.jar *.xml

• HTML view (via Solritas)

• easily enabled in Apache distro

• built-in to Lucid certified distro

6

Page 7: Solr Powered Lucene

• Created by Yonik Seeley for CNET

• Contributed to Apache in January 2006

• December 2006: Version 1.

• June 2007: Version 1.2

• September 2008: Version 1.3

• November 2009: Version 1.4

Solr History

7

Page 8: Solr Powered Lucene

• Lucene power exposed over HTTP

• Spell checking, highlighting, more-like-this

• Caching

• Replication

• Faceting

• Distributed search

Features

8

Page 9: Solr Powered Lucene

• HTTP GET/POST (curl or any other HTTP client)

• JSON

• SolrJ (embedded or HTTP)

• Ruby: solr-ruby, RSolr, etc

• Many others: python, PHP, solrsharp, XSLT

Solr APIs

9

Page 10: Solr Powered Lucene

Deployment Architecture

• Scales from:

• single Solr server

• master/replicants(slaves)

• distributed shards

• Each Solr instance can also have multiple cores

10

Page 11: Solr Powered Lucene

Lucene Fundamentals

11

Page 12: Solr Powered Lucene

• Index

• Document

• Field

• Terms (aka Tokens)

Concepts

12

Page 13: Solr Powered Lucene

Inverted Index

• Commonly used search engine data structure

• Efficient lookup of terms across large number of documents

• Usually stores positional information to enable phrase/proximity queries

From "Taming Text" by Grant Ingersoll and Tom Morton

13

Page 14: Solr Powered Lucene

14

Page 15: Solr Powered Lucene

What's in a token?

15

Page 16: Solr Powered Lucene

Lucene Scoring

q1

d1

Θ

16

Page 17: Solr Powered Lucene

Relevance

• Term frequency (TF): number of times a term appears in a document

• Inverse document frequency (IDF): One over number of times term appears in the index (1/df)

• Field length normalization: control affect field length, in number of terms, has on score

• Boost factors: terms, fields, or documents

17

Page 18: Solr Powered Lucene

Solr Core

• single primary index

• schema.xml / solrconfig.xml

• (optionally) multiple cores per Solr instance, configured in solr.xml

• other configuration and data files

18

Page 19: Solr Powered Lucene

• Field types

• Fields

• Unique key (optional*)* I've never seen a case that didn't require a unique identifier per document

• copy fields

• similarity and Solr query parser configuration

schema.xml

19

Page 20: Solr Powered Lucene

Schema Analysis

• http://localhost:8983/solr/admin/analysis.jsp

• Document analysis request handler:curl http://localhost:8983/solr/analysis/document --data-binary @ipod_video.xml -H 'Content-type:text/xml; charset=utf-8'

• Field analysis request handler:http://localhost:8983/solr/analysis/field?analysis.fieldtype=text&analysis.fieldvalue=Foo%20Bar&q=foo&analysis.showmatch=true

20

Page 21: Solr Powered Lucene

• Lucene indexing parameters

• Cache settings

• Request handler configuration

• HTTP cache settings

• Search components, response writers, query parsers

solrconfig.xml

21

Page 22: Solr Powered Lucene

• mini-“servlets”

• SearchHandler extensions chain search components

• Flexible response formatting:

• &wt=[json, ruby, xslt, php, phps, javabin, python,velocity]

Request handlers

22

Page 23: Solr Powered Lucene

<add><doc> <field name="id">MA147LL/A</field> <field name="name">Apple 60 GB iPod with Video Playback Black</field> <field name="manu">Apple Computer Inc.</field> <field name="cat">electronics</field> <field name="cat">music</field> <field name="features">iTunes, Podcasts, Audiobooks</field> <field name="features">Stores up to 15,000 songs, 25,000 photos, or 150 hours of video</field> <field name="features">2.5-inch, 320x240 color TFT LCD display with LED backlight</field> <field name="features">Up to 20 hours of battery life</field> <field name="features">Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video</field> <field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field> <field name="includes">earbud headphones, USB cable</field> <field name="weight">5.5</field> <field name="price">399.00</field> <field name="popularity">10</field> <field name="inStock">true</field></doc></add>

Solr XML

23

Page 24: Solr Powered Lucene

Indexing Solr XML

• Via curl:curl 'http://localhost:8983/solr/update?commit=true' --data-binary @ipod_video.xml -H 'Content-type:text/xml; charset=utf-8'

• Via Solr's Java-based post tool:java -jar post.jar ipod_video.xml

24

Page 25: Solr Powered Lucene

curl 'http://localhost:8983/solr/update/csv?commit=true' --data-binary @books.csv -H 'Content-type:text/plain; charset=utf-8'

Indexing CSV

25

Page 26: Solr Powered Lucene

Content Streams

• Allows Solr server to fetch local or remote data itself. Must enable remote streaming in solrconfig.xml

• http://localhost:8983/solr/update

• ?stream.file=<local Solr path to exampledocs>/ipod_video.xml

• ?stream.url=<url to content>

• Security warning: allows Solr to fetch arbitrary server-side file or network URL content

26

Page 27: Solr Powered Lucene

SolrServer solr = new CommonsHttpSolrServer( new URL("http://localhost:8983/solr"));

SolrInputDocument doc = new SolrInputDocument();doc.addField("id", "EXAMPLEDOC01");doc.addField("title", "NOVAJUG SolrJ Example"); solr.add(doc);

solr.commit(); // after a batch, not per document

solr.optimize(); // periodically, if/when needed

Indexing with SolrJ

27

Page 28: Solr Powered Lucene

solr = Connection.new( 'http://localhost:8983/solr', :autocommit => :on

solr.add(:id => 123, :title => 'Solr in Action')

solr.optimize # periodically, as needed

Indexing with solr-ruby

28

Page 29: Solr Powered Lucene

• Delete:

• <delete><id>05991</id></delete>

• <delete> <query>category:Unused</query></delete>

• java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"

• Update: simply <add> doc with same unique key

• <commit/> pending documents

• <optimize/> index, squeezes out deleted documents, collapses segments

• <rollback/> to last commit point

delete, update, etc

Update commands via GET: http://localhost:8983/solr/update?stream.body=<commit/>

29

Page 30: Solr Powered Lucene

Data Import Handler

• Indexes relational database, XML data, and e-mail sources

• Supports full and incremental/delta indexing

• Highly extensible with custom data sources, transformers, etc

• http://wiki.apache.org/solr/DataImportHandler

30

Page 31: Solr Powered Lucene

DIH details

• Put JDBC driver JAR in <solr-home>/lib, configure dataimport request handler

• http://localhost:8983/solr/db/admin/dataimport.jsp - debugging console

• http://localhost:8983/solr/db/dataimport?command=full-import - removes all documents and imports from scratch

31

Page 32: Solr Powered Lucene

Solr Cell

• aka ExtractingRequestHandler

• leveraging Tika, extracts and indexes rich documents such as Word, PDF, HTML, and many other types

• curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "[email protected]"

• http://wiki.apache.org/solr/ExtractingRequestHandler

32

Page 33: Solr Powered Lucene

• http://localhost:8983/solr/select?q=query

Standard Search Request

33

Page 34: Solr Powered Lucene

Debug Query

• &debugQuery=true is your friend

• Includes parsed query, explanations, and search component timings in response

34

Page 35: Solr Powered Lucene

• Send GET HTTP requests

• http://localhost:8983/solr/select?q=solr&start=0&rows=10&fl=id,name

• start: zero-based starting result

• rows: number of hits to return

• fl: list of stored fields to return

Searching

35

Page 36: Solr Powered Lucene

Query Parser

• Controlled by defType parameter

• &defType=lucene (actually a Solr extension of Lucene’s QueryParser)

• &defType=dismax

• Local {!...} override syntax

36

Page 37: Solr Powered Lucene

Solr Query Parser

• http://lucene.apache.org/java/2_9_1/queryparsersyntax.html+ Solr extensions

• Kitchen sink parser, includes advanced user-unfriendly syntax

• Syntax errors throw parse exceptions back to client

• Example: title:ipod* AND price:[0 TO 100]

• http://wiki.apache.org/solr/SolrQuerySyntax

37

Page 38: Solr Powered Lucene

Dismax Query Parser

• Simplified syntax: loose text “quote phrases” -prohibited +required

• Spreads query terms across query fields (qf) with dynamic boosting per field, phrase construction (pf), and boosting query and function capabilities (bq and bf)

38

Page 39: Solr Powered Lucene

SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr");

SolrQuery params = new SolrQuery("author:John");params.setFields("*,score");params.setRows(3);

QueryResponse response = server.query(params);

for (SolrDocument document : response.getResults()) { System.out.println("Doc: " + document);}

Searching with SolrJ

39

Page 40: Solr Powered Lucene

conn = Connection.new( 'http://localhost:8983/solr')

conn.query('my query') do |hit| puts hit.inspectend

Searching with Ruby

40

Page 41: Solr Powered Lucene

Built-in search components

• Standard: query, facet, mlt, highlight, stats, debug

• Others: elevation, clustering, term, term vector

41

Page 42: Solr Powered Lucene

• Counts per subset within results

• Facet on: field terms, queries, date ranges

• &facet=on&facet.field=cat&facet.query=price:[0 TO 100]

• http://wiki.apache.org/solr/SimpleFacetParameters

Faceting

42

Page 43: Solr Powered Lucene

• http://localhost:8983/solr/spell?q=epod&spellcheck=on&spellcheck.build=true

• File or index-based dictionaries

• Supports pluggable distance algorithms: Levenstein and JaroWinkler

• http://wiki.apache.org/solr/

Spell checking

43

Page 44: Solr Powered Lucene

• http://localhost:8983/solr/select?q=apple&hl=on&hl.fl=*

• http://wiki.apache.org/solr/HighlightingParameters

Highlighting

44

Page 45: Solr Powered Lucene

More Like This

• http://localhost:8983/solr/select?q=ipod&mlt=true&mlt.fl=manu,cat&mlt.mindf=1&mlt.mintf=1&fl=id,score,name

• http://wiki.apache.org/solr/MoreLikeThis

45

Page 46: Solr Powered Lucene

Query Elevation

• http://localhost:8983/solr/elevate?q=ipod&debugQuery=true&enableElevation=true

• Configure an “elevate.xml” to boost/exclude specific documents

• http://wiki.apache.org/solr/QueryElevationComponent

46

Page 47: Solr Powered Lucene

Clustering

• Dynamic grouping of documents into labeled sets

• http://localhost:8983/solr/clustering?q=*:*&rows=10

• http://wiki.apache.org/solr/ClusteringComponent

• Requires additional steps to install (see documentation) with Apache Solr distro

47

Page 48: Solr Powered Lucene

Terms

• Enumerates terms from specified fields

• http://localhost:8983/solr/terms?terms.fl=name&terms.sort=index&terms.prefix=vi

48

Page 49: Solr Powered Lucene

Term Vectors

• Details term vector information: term frequency, document frequency, position and offset information

• http://localhost:8983/solr/select/?q=*%3A*&qt=tvrh&tv=true&tv.all=true

49

Page 50: Solr Powered Lucene

stats.jsp

• Not technically a “request handler”, outputs only XML

• http://localhost:8983/solr/admin/stats.jsp

• Index stats such as number of documents, searcher open time

• Request handler details, number of requests and errors, average request time,

50

Page 51: Solr Powered Lucene

Replication

• Master is polled

• Replicant pulls Lucene index and optionally also Solr configuration files

• Query throughput scaling: replicate and load balance

• http://wiki.apache.org/solr/SolrReplication

51

Page 52: Solr Powered Lucene

Distributed Search

• Distribute documents to same-schema shards

• Scaling for when single index becomes too large, or a single query becomes too slow

• http://wiki.apache.org/solr/DistributedSearch

52

Page 53: Solr Powered Lucene

What’s new in Solr 1.4?• Java-based replication

• VelocityResponseWriter (Solritas)

• AJAX-Solr

• Logging switched to SLF4J

• Rollback, since last commit

• StatsComponent

• TermVectorComponent

• Configurable Directory provider

53

Page 54: Solr Powered Lucene

Lucene 2.9

• IndexReader#reopen()

• Faster filter performance, by 300% in some cases

• Per-segment FieldCache

• Reusable token streams

• Faster numeric/date range queries, thanks to trie

• and tons more, see Lucene 2.9's CHANGES.txt

54

Page 55: Solr Powered Lucene

Performance Improvements

• Caching

• Concurrent file access

• Per-segment index updates

• Faceting

• DocSet generation, avoids scoring

• Streaming updates for SolrJ

55

Page 56: Solr Powered Lucene

Feature Improvements• Rich document

indexing

• DataImportHandler enhancements

• Smoother replication

• More choices for logging

• Multi-select faceting

• Speedier range queries

• Duplicate detection

• New request handler components

56

Page 57: Solr Powered Lucene

• http://wiki.apache.org/solr

[email protected]

• Lucid Imagination

• http://www.lucidimagination.com

• Articles, webinars, blogs, and...

• Search the Lucene ecosystem at:http://search.lucidimagination.com

[email protected]

Resources

57

Page 58: Solr Powered Lucene

Shout Out

58

Page 59: Solr Powered Lucene

e-book now available!print coming soon

http://www.manning.com/lucene

59

Page 60: Solr Powered Lucene

LucidWorks for Solr• Certified Distribution

• Value-added integration

• KStemmer

• Carrot2 clustering

• LucidGaze for Solr

• installer

• Reference Manual

• Solr 1.4 certified distro coming soon!

60

Page 61: Solr Powered Lucene

LucidGaze for Solr

• Monitoring tool, captures, stores, and interactively views Solr performance metrics

• requests/second

• time/request

61

Page 62: Solr Powered Lucene

62

Page 63: Solr Powered Lucene

LucidFind

http://search.lucidimagination.com/?q=novajug

63

Page 64: Solr Powered Lucene

64