44
Search Engine-Building with Lucene and Solr Kai Chan SoCal Code Camp, July 2013

Search Engine-Building with Lucene and Solr

Embed Size (px)

DESCRIPTION

These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013. http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533

Citation preview

Page 1: Search Engine-Building with Lucene and Solr

Search Engine-Building with Lucene and Solr

Kai ChanSoCal Code Camp, July 2013

Page 2: Search Engine-Building with Lucene and Solr

How to Search - One Approachfor each document d { if (query is a substring of d's content) { add d to the list of results }}sort the result (or not)

Page 3: Search Engine-Building with Lucene and Solr

How to Search - Problems

● slow○ reads the whole dataset for each search

● not scalable○ if you dataset grows by 10x,

your search slows down by 10x● how to show the most relevant documents

first?○ list of results can be quite long○ users have limited time and patience

Page 4: Search Engine-Building with Lucene and Solr

Inverted Index - Introduction

● like the "index" at the end of books● a map of one of the following types

○ term → document list○ term → <document, position> list

Page 5: Search Engine-Building with Lucene and Solr

documents:T[0] = "it is what it is"T[1] = "what is it"T[2] = "it is a banana"

inverted index (without positions):"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}

inverted index (with positions):"a": {(2, 2)}"banana": {(2, 3)}"is": {(0, 1), (0, 4), (1, 1), (2, 1)}"it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)}

Credit: Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)

Page 6: Search Engine-Building with Lucene and Solr

Inverted Index - Speed

● term list○ typically very small○ grows slowly

● term lookup○ O(1) to O(log(number of terms))

● for a particular term○ document lists: very small○ document + position lists: still small

● few terms per query

Page 7: Search Engine-Building with Lucene and Solr

Inverted Index - Relevance

● information in the index enables:○ determination (scoring) of relevance of each

document to the query○ comparison of relevance among documents○ sorting by (decreasing) relevance

■ i.e. the most relevant document first

Page 8: Search Engine-Building with Lucene and Solr

Lucene v.s. Solr - Lucene

● full-text search library● creates, updates and read from the index● takes queries and produces search results● your application creates objects and calls

methods in the Lucene API● provides building blocks for custom features

Page 9: Search Engine-Building with Lucene and Solr

Lucene v.s. Solr - Solr

● full-text search server● uses Lucene for indexing and search● REST-like API over HTTP● different output formats (e.g. XML, JSON)● provides some features not built into Lucene

Page 10: Search Engine-Building with Lucene and Solr

machine running Java VM

your application

machine running Java VM

servlet container (e.g. Tomcat, Jetty)

SolrSolr code

Lucene code librariesindex

Lucene

Lucene code

indexlibraries

clientHTTP

Lucene:

Solr:

Page 11: Search Engine-Building with Lucene and Solr

Workflow

Setup

Indexing

Search

Page 12: Search Engine-Building with Lucene and Solr

Workflow

Setup

Indexing

Search

Page 13: Search Engine-Building with Lucene and Solr

Workflow - Setup

● servlet configuration○ e.g. port number, max POST size○ you can usually use the default settings

● Solr configuration○ e.g. data directory, deduplication, language

identification, highlighting○ you can usually use the default settings

● schema definition○ defines fields in your documents○ you can use the default settings if you name your

fields in a certain way

Page 14: Search Engine-Building with Lucene and Solr

How Data Are Organized

collection

document document document

field

field

field

field

field

field

field

field

field

Page 15: Search Engine-Building with Lucene and Solr

field

content (e.g. "please read" or 30)

name (e.g. "title" or "price")

type

options

Page 16: Search Engine-Building with Lucene and Solr

index

document document document

subject

date

from

subject

date

from

date

from

text text

reply-to

text

reply-to

Page 17: Search Engine-Building with Lucene and Solr

index

document document document

subject

date

from

title

SKU

price

last name

phone

text description

first name

address

Page 18: Search Engine-Building with Lucene and Solr

Solr Field Definition

● field○ name (e.g. "subject")○ type (e.g. "text_general")○ options (e.g. indexed="true" stored="true")

● field type○ text: "string", "text_general"○ numeric: "int", "long", "float", "double"

● options○ indexed: content can be searched○ stored: content can be returned at search-time○ multivalued: multiple values per field & document

Page 19: Search Engine-Building with Lucene and Solr

Solr Dynamic Field

● define field by naming convention● "amount_i": int, index, stored● "tag_ss": string, indexed, stored, multivalued

name type indexed stored multiValued

*_i int true true false

*_l long true true false

*_f float true true false

*_d double true true false

*_s string true true false

*_ss string true true true

*_t text_general true true false

*_txt text_general true true true

Page 20: Search Engine-Building with Lucene and Solr

Solr Copy Field

● copy one or more fields into another field● can be used to define a catch-all field

○ source: "title", "author", "description"○ destination: "text"○ searching the "text" field has the effect of searching

all the other three fields

Page 21: Search Engine-Building with Lucene and Solr

Workflow

Setup

Indexing

Search

Page 22: Search Engine-Building with Lucene and Solr

Indexing - UpdateRequestHandler

● upload content or file to http://host:port/solr/update

● formats: XML, JSON, CSV

Page 23: Search Engine-Building with Lucene and Solr

XML:<add> <doc> <field name="id">apple</field> <field name="compName">Apple</field> <field name="address">1 Infinite Way, Cupertino CA</field> </doc> <doc> <field name="id">asus</field> <field name="compName">ASUS Computer</field> <field name="address">800 Corporate Way Fremont, CA 94539</field> </doc></add>

CSV:id,compName_s,address_sapple,Apple,"1 Infinite Way, Cupertino CA"asus,Asus Computer,"800 Corporate Way Fremont, CA 94539"

JSON:[ {"id":"apple","compName_s":"Apple","address_s":"1 Infinite Way, Cupertino CA"} {"id":"asus","compName_s":"Asus Computer","address_s":"800 Corporate Way Fremont, CA 94539"}]

Page 24: Search Engine-Building with Lucene and Solr

Indexing - DataImportHandler

● has its own config file (data-config.xml)● import data from various sources

○ RDBMS (JDBC)○ e-mail (IMAP)○ XML data locally (file) or remotely (HTTP)

● transformers ○ extract data (RegEx, XPath)○ manipulate data (strip HTML tags)

Page 25: Search Engine-Building with Lucene and Solr

Workflow

Setup

Indexing

Search

Page 26: Search Engine-Building with Lucene and Solr

Searching - Basics

● send request to http://host:port/solr/search● parameters

○ q - main query○ fq - filter query○ defType - query parser (e.g. lucene, edismax)○ fl - fields to return○ sort - sort criteria○ wt - response writer (e.g. xml, json)○ indent - set to true for pretty-printing

Page 27: Search Engine-Building with Lucene and Solr

http://localhost:8983/solr/select?q=title:tablet&fl=title,price,inStock&sort=price&wt=json

search handler's URL main query

response writersort criteriafields to return

Page 28: Search Engine-Building with Lucene and Solr

Searching - Query Syntax - Field

● search a specific field○ field_name:value

● if field omitted, Solr uses default field:○ df parameter in URL○ defaultSearchField setting in schema.xml○ "text"

Page 29: Search Engine-Building with Lucene and Solr

Searching - Query Syntax - Term

● a term by itself: matches documents that contain that term○ e.g. tablet

Page 30: Search Engine-Building with Lucene and Solr

Searching - Query Syntax - Boolean

● boolean operators are supported○ AND &&○ OR ||○ NOT !

● e.g. a AND b○ all of a, b must occur

● e.g. a OR b○ at least one of a, b must occur

● e.g. a AND NOT b○ a must occur and b must not occur

Page 31: Search Engine-Building with Lucene and Solr

Searching - Query Syntax - Boolean

● Lucene/Solr's boolean operators are not true boolean operators

● e.g. a OR b OR c does not behave like (a OR b) OR c ○ instead, a OR b OR c means at least one of a, b, c

must occur● parentheses are supported

Page 32: Search Engine-Building with Lucene and Solr

Searching - Query Syntax - Boolean

● "+" prefix means "must"● "-" prefix means "must not"● no prefix means "at least one must"

(by default)○ e.g. a b c

■ at least one of a, b, c must occur● operators can mix

○ e.g. +a b c d -e■ a must occur■ at least one of b, c, d must occur■ e must not occur

Page 33: Search Engine-Building with Lucene and Solr

Searching - Query Syntax - Phrase

● phrases are enclosed by double-quotes● e.g. +"the phrase"

○ the phrase must occur● e.g. -"the phrase"

○ the phrase must not occur

Page 34: Search Engine-Building with Lucene and Solr

Searching - Query Syntax - Boost

● manually assign different weights to clauses● gives more weight to a field

○ e.g. title:a^10 body:a● gives more weight to a word

○ e.g. title:a title:b^10● gives phrases more weight than words

○ e.g. title:(+a +b) title:"a b"^10

Page 35: Search Engine-Building with Lucene and Solr

Searching - Query Syntax - Range

● matches field values within a range○ inclusive range - denoted by square brackets○ exclusive range - denoted by curly brackets

● e.g. age:[10 TO 20]○ matches the field "age" with the value in 10..20

● string or numeric comparison, depending on the field's type

Page 36: Search Engine-Building with Lucene and Solr

Searching - Query Syntax - EDisMax

● suitable for user-generated queries○ supports a subset of Lucene QP's syntax○ does not complain about the syntax○ searches for individual words across several fields

("disjunction")○ uses max score of a word in all fields for scoring

("max")● configurable (in solrconfig.xml)

○ what fields to search the words in○ weighting of these fields

Page 37: Search Engine-Building with Lucene and Solr

Sorting

● default: sorting by decreasing score● sorting by field: using the sort parameter

○ specify field name and order■ price asc - sort by "price" field, ascending■ price desc - sort by "price" field, descending

○ multiple fields and orders by comma■ starRating desc, price asc - sort by

"starRating" field, descending, and then by "price" field, ascending

○ cannot use multivalued fields○ overrides sorting by decreasing relevance

Page 38: Search Engine-Building with Lucene and Solr

Faceted Search

● facet values: (distinct) values (generally non-overlapping) ranges of a field

● displaying facets○ show possible values○ let users narrow down their searches easily

Page 39: Search Engine-Building with Lucene and Solr

facet

facet values (5 of them)

Page 40: Search Engine-Building with Lucene and Solr

Faceted Search

● set facet parameter to true - enables faceting

● other parameters○ facet.field - use the field's values as facets

■ return <value, count> pairs○ facet.query - use the given queries as facets

■ return <query, count> pairs○ facet.sort - set the ordering of the facets;

■ can be "count" or "index"○ facet.offset and face.limit - used for

pagination of facets

Page 41: Search Engine-Building with Lucene and Solr

Resources - Books

● Lucene in Action○ written by 3 committer and PMC members○ somewhat outdated (2010; covers Lucene 3.0)○ http://www.manning.com/hatcher3/

● Solr in Action○ early access; coming out later this year○ http://www.manning.com/grainger/

● Apache Solr 4 Cookbook○ common problems and useful tips○ http://www.packtpub.com/apache-solr-4-

cookbook/book

Page 42: Search Engine-Building with Lucene and Solr

Resources - Books

● Introduction to Information Retrieval○ not specific to Lucene/Solr, but about IR concepts○ free e-book○ http://nlp.stanford.edu/IR-book/

● Managing Gigabytes○ indexing, compression and other topics○ accompanied by MG4J - a full-text search software○ http://mg4j.di.unimi.it/

Page 43: Search Engine-Building with Lucene and Solr

Resources - Web

● official websites○ Lucene Core - http://lucene.apache.org/core/○ Solr - http://lucene.apache.org/solr/

● mailing lists● Wiki sites

○ Lucene Core - http://wiki.apache.org/lucene-java/○ Solr - http://wiki.apache.org/solr/

● reference guides○ API Documentation for Lucene and Solr○ Apache Solr Reference Guide (LucidWorks) - http:

//lucene.apache.org/solr/tutorial.html

Page 44: Search Engine-Building with Lucene and Solr

Getting Started

● download Solr○ requires Java 6 or newer to run

● Solr comes bundled and configured with Jetty○ <Solr directory>/example/start.jar

● "exampledocs" directory contains sample documents○ <Solr directory>/example/exampledocs/post.jar

● use the Solr admin interface○ http://localhost:8983/solr/