Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Preview:

DESCRIPTION

Slides for my presentation at SoCal Code Camp, June 29, 2014 (http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6337660f-37de-4d6e-a5bc-46ba54478e5e)

Citation preview

Search Engine-Building with Lucene and Solr

Kai ChanSoCal Code Camp, June 2014

http://bit.ly/sdcodecamp2014solr

all data

matched data

data that a user actually sees

Lucene

● full-text search library● creates, updates and read from the index● takes queries and produces search results● your application creates objects and calls

methods in the Lucene API● provides building blocks for custom features

Solr

● full-text search platform● uses Lucene for indexing and search● REST-like API over HTTP● different output formats (e.g. XML, JSON)● provides some features not built into Lucene

machine running Java VM

your application

machine running Java VM

servlet container (e.g. Tomcat, Jetty)

Solr

Solr code

Lucene code libraries

index

Lucene

Lucene code

indexlibraries

clientHTTP

Lucene:

Solr:

How Data Are Organized

collection

document document document

field

field

field

field

field

field

field

field

field

field

content (e.g. "please read" or 30)

name (e.g. "title" or "price")

type

options

collection

document document document

subject

date

from

subject

date

from

date

from

text text

reply-to

text

reply-to

collection

document document document

subject

date

from

title

SKU

price

last name

phone

text description

first name

address

Solr Field Definition

● fieldo name (e.g. "subject")o type (e.g. "text_general")o options (e.g. indexed="true" stored="true")

● field typeo text: "string", "text_general"o numeric: "int", "long", "float", "double"

● optionso indexed: content can be searchedo stored: content can be returned at search-timeo multivalued: multiple values per field & document

Solr Dynamic Field

● define field by naming convention● "amount_i": int, index, stored● "tag_ss": string, indexed, stored, multivalued

Solr Copy Field

● copy one or more fields into another field● can be used to define a catch-all field

o source: "title", "author", "content"o destination: "text"o searching the "text" field has the effect of searching

all the other three fields

Indexing - UpdateRequestHandler

● upload (POST) content or file to http://host:port/solr/update

● formats: XML, JSON, CSV

Indexing - DataImportHandler

● has its own config file (data-config.xml)● import data from various sources

o RDBMS (JDBC)o e-mail (IMAP)o XML data locally (file) or remotely (HTTP)

● transformers o extract data (RegEx, XPath)o manipulate data (strip HTML tags)

Indexing - ExtractingRequestHandler

● allows indexing of different formatso e.g. PDF, MS Word, XML

● extract text and metadata● maps extracted text to the “content” field● maps metadata to different fields

Searching - Basics

● send request to http://host:port/solr/search● parameters

o q - main queryo fq - filter queryo defType - query parser (e.g. lucene, edismax)o fl - fields to returno sort - sort criteriao wt - response writer (e.g. xml, json)o indent - set to true for pretty-printing

http://localhost:8983/solr/select?q=title:tablet&fl=title,price,inStock&sort=price&wt=json

search handler's URL main query

response writersort criteriafields to return

Searching - Query Syntax

name:tablet

name:”galaxy tab”name:tablet category:tablet

+name:tablet +category:tablet

Searching - Query Syntax (cont.)

+name:tablet +(manu:apple manu:samsung)

+name:tablet -manu:apple

+name:tablet +range:[300 TO 500]

+name:tablet manu:apple^5

EDisMax Parser

● suitable for user-generated querieso does not complain about the syntaxo does not require field name in queryo searches across several fields

● configurable

● default: sorting by decreasing score● custom sorting rules: use the sort parameter

o syntax: fieldName (asc|desc)o e.g. sort by ascending price (i.e. lowest price

first):price asco e.g. sort by descending date (i.e. newest date

first):date asc

Sorting

Sorting

● multiple fields and orders: separate by commaso e.g. sort by descending starRating and ascending

price:o starRating desc, price asc

Sorting

● cannot use multivalued fields● overrides the default sorting behavior

Faceted Search

● facet values: (distinct) values (generally non-overlapping) ranges of a field

● displaying facetso show possible valueso let users narrow down their searches easily

facet

facet values (5 of them)

Faceted Search

● set facet parameter to true - enables faceting

● other parameterso facet.field - use the field's values as facets

return <value, count> pairso facet.query - use the given queries as facets

return <query, count> pairso facet.sort - set the ordering of the facets;

can be "count" or "index"o facet.offset and face.limit - used for

pagination of facets

Spatial Search

● data: locations (longitudes, latitudes)● search: filter and/or sort by location

Filter by Location

● geofilto circle centered at a given pointo distance from a given pointo fq={!geofilt sfield=store}&pt=45.15,-

93.85&d=5● bbox

o square (“bounding box”) centered at a given pointo distance from a given point + cornerso fq={!bbox sfield=store}&pt=45.15,-

93.85&d=5

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

geofilt bbox

5 km 5 km

(45.15, -93.85) (45.15, -93.85)

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

geofilt bbox

5 km 5 km

(45.15, -93.85) (45.15, -93.85)

x

o

o

x

x

x

o

o

o

o

x

o

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

Sort by Location

● geodisto returns the distance between the location given in a

field and a certain coordinateo e.g. sort by ascending distance from (45.15,-93.85),

and return the distances as the score:q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

Scaling/Redundancy

problem solution

collection too large for a single machine

distribution

too many requests for a single machine

distribution

a machine can go down replication

SolrCloud

● Solr instanceso collection (logical index) divided into one or more

partial collections (“shards”)o for each shard, one or more Solr instances keep

copies of the data one as leader - handles reads and writes others as replicas - handle reads

● ZooKeeper instances

SolrCloud

● Solr instances● ZooKeeper instances

o management of Solr instanceso leader electiono node discovery

leader replica replica

leader replica

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

leader replica replica

leader replica

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

replica

leader replica replica

(offline) leader

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

replica

leader replica replica

replica leader

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

replica

Resources - Books

● Solr in Actiono just released, up-to-dateo http://www.manning.com/grainger/

● Apache Solr 4 Cookbooko common problems and useful tipso http://www.packtpub.com/apache-solr-4-cookbook/b

ook● Lucene in Action

o written by 3 committer and PMC memberso somewhat outdated (2010; covers Lucene 3.0)o http://www.manning.com/hatcher3/

Resources - Books

● Introduction to Information Retrievalo not specific to Lucene/Solr, but about IR conceptso free e-booko http://nlp.stanford.edu/IR-book/

● Managing Gigabyteso indexing, compression and other topicso accompanied by MG4J - a full-text search softwareo http://mg4j.di.unimi.it/

Resources - Web

● official websiteo http://lucene.apache.org/o Wikio reference guideo mailing list

● StackOverflowo http://stackoverflow.com/o “Lucene” and “Solr” tags

Getting Started

● download Solro requires Java 7 or newer to run

● Solr comes bundled/configured with Jettyo <Solr directory>/example/start.jar

● "exampledocs" directory contains sample documentso <Solr directory>/example/exampledocs/post.jaro java

-Durl=http://localhost:8983/solr/update -jar post.jar *.xml

● use the Solr admin interfaceo http://localhost:8983/solr/

Thanks for Coming!

● Java Performance Tips @ 10:15, same room● slides available

o http://bit.ly/sdcodecamp2014solr● please vote for my conference session

o http://bit.ly/tvnews2014● questions/feedback

o kai@ssc.ucla.edu● questions?