43
Search Engine- Building with Lucene and Solr Kai Chan SoCal Code Camp, June 2014 http://bit.ly/sdcodecamp2014solr

Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Embed Size (px)

DESCRIPTION

Slides for my presentation at SoCal Code Camp, June 29, 2014 (http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6337660f-37de-4d6e-a5bc-46ba54478e5e)

Citation preview

Page 1: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Search Engine-Building with Lucene and Solr

Kai ChanSoCal Code Camp, June 2014

http://bit.ly/sdcodecamp2014solr

Page 2: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

all data

matched data

data that a user actually sees

Page 3: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Lucene

● full-text search library● creates, updates and read from the index● takes queries and produces search results● your application creates objects and calls

methods in the Lucene API● provides building blocks for custom features

Page 4: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Solr

● full-text search platform● uses Lucene for indexing and search● REST-like API over HTTP● different output formats (e.g. XML, JSON)● provides some features not built into Lucene

Page 5: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

machine running Java VM

your application

machine running Java VM

servlet container (e.g. Tomcat, Jetty)

Solr

Solr code

Lucene code libraries

index

Lucene

Lucene code

indexlibraries

clientHTTP

Lucene:

Solr:

Page 6: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

How Data Are Organized

collection

document document document

field

field

field

field

field

field

field

field

field

Page 7: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

field

content (e.g. "please read" or 30)

name (e.g. "title" or "price")

type

options

Page 8: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

collection

document document document

subject

date

from

subject

date

from

date

from

text text

reply-to

text

reply-to

Page 9: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

collection

document document document

subject

date

from

title

SKU

price

last name

phone

text description

first name

address

Page 10: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Solr Field Definition

● fieldo name (e.g. "subject")o type (e.g. "text_general")o options (e.g. indexed="true" stored="true")

● field typeo text: "string", "text_general"o numeric: "int", "long", "float", "double"

● optionso indexed: content can be searchedo stored: content can be returned at search-timeo multivalued: multiple values per field & document

Page 11: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Solr Dynamic Field

● define field by naming convention● "amount_i": int, index, stored● "tag_ss": string, indexed, stored, multivalued

Page 12: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Solr Copy Field

● copy one or more fields into another field● can be used to define a catch-all field

o source: "title", "author", "content"o destination: "text"o searching the "text" field has the effect of searching

all the other three fields

Page 13: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Indexing - UpdateRequestHandler

● upload (POST) content or file to http://host:port/solr/update

● formats: XML, JSON, CSV

Page 14: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Indexing - DataImportHandler

● has its own config file (data-config.xml)● import data from various sources

o RDBMS (JDBC)o e-mail (IMAP)o XML data locally (file) or remotely (HTTP)

● transformers o extract data (RegEx, XPath)o manipulate data (strip HTML tags)

Page 15: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Indexing - ExtractingRequestHandler

● allows indexing of different formatso e.g. PDF, MS Word, XML

● extract text and metadata● maps extracted text to the “content” field● maps metadata to different fields

Page 16: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Searching - Basics

● send request to http://host:port/solr/search● parameters

o q - main queryo fq - filter queryo defType - query parser (e.g. lucene, edismax)o fl - fields to returno sort - sort criteriao wt - response writer (e.g. xml, json)o indent - set to true for pretty-printing

Page 17: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

http://localhost:8983/solr/select?q=title:tablet&fl=title,price,inStock&sort=price&wt=json

search handler's URL main query

response writersort criteriafields to return

Page 18: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Searching - Query Syntax

name:tablet

name:”galaxy tab”name:tablet category:tablet

+name:tablet +category:tablet

Page 19: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Searching - Query Syntax (cont.)

+name:tablet +(manu:apple manu:samsung)

+name:tablet -manu:apple

+name:tablet +range:[300 TO 500]

+name:tablet manu:apple^5

Page 20: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

EDisMax Parser

● suitable for user-generated querieso does not complain about the syntaxo does not require field name in queryo searches across several fields

● configurable

Page 21: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

● default: sorting by decreasing score● custom sorting rules: use the sort parameter

o syntax: fieldName (asc|desc)o e.g. sort by ascending price (i.e. lowest price

first):price asco e.g. sort by descending date (i.e. newest date

first):date asc

Sorting

Page 22: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Sorting

● multiple fields and orders: separate by commaso e.g. sort by descending starRating and ascending

price:o starRating desc, price asc

Page 23: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Sorting

● cannot use multivalued fields● overrides the default sorting behavior

Page 24: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Faceted Search

● facet values: (distinct) values (generally non-overlapping) ranges of a field

● displaying facetso show possible valueso let users narrow down their searches easily

Page 25: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

facet

facet values (5 of them)

Page 26: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Faceted Search

● set facet parameter to true - enables faceting

● other parameterso facet.field - use the field's values as facets

return <value, count> pairso facet.query - use the given queries as facets

return <query, count> pairso facet.sort - set the ordering of the facets;

can be "count" or "index"o facet.offset and face.limit - used for

pagination of facets

Page 27: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Spatial Search

● data: locations (longitudes, latitudes)● search: filter and/or sort by location

Page 28: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Filter by Location

● geofilto circle centered at a given pointo distance from a given pointo fq={!geofilt sfield=store}&pt=45.15,-

93.85&d=5● bbox

o square (“bounding box”) centered at a given pointo distance from a given point + cornerso fq={!bbox sfield=store}&pt=45.15,-

93.85&d=5

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

Page 29: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

geofilt bbox

5 km 5 km

(45.15, -93.85) (45.15, -93.85)

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

Page 30: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

geofilt bbox

5 km 5 km

(45.15, -93.85) (45.15, -93.85)

x

o

o

x

x

x

o

o

o

o

x

o

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

Page 31: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Sort by Location

● geodisto returns the distance between the location given in a

field and a certain coordinateo e.g. sort by ascending distance from (45.15,-93.85),

and return the distances as the score:q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

Page 32: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Scaling/Redundancy

problem solution

collection too large for a single machine

distribution

too many requests for a single machine

distribution

a machine can go down replication

Page 33: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

SolrCloud

● Solr instanceso collection (logical index) divided into one or more

partial collections (“shards”)o for each shard, one or more Solr instances keep

copies of the data one as leader - handles reads and writes others as replicas - handle reads

● ZooKeeper instances

Page 34: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

SolrCloud

● Solr instances● ZooKeeper instances

o management of Solr instanceso leader electiono node discovery

Page 35: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

leader replica replica

leader replica

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

Page 36: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

leader replica replica

leader replica

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

replica

Page 37: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

leader replica replica

(offline) leader

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

replica

Page 38: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

leader replica replica

replica leader

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

replica

Page 39: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Resources - Books

● Solr in Actiono just released, up-to-dateo http://www.manning.com/grainger/

● Apache Solr 4 Cookbooko common problems and useful tipso http://www.packtpub.com/apache-solr-4-cookbook/b

ook● Lucene in Action

o written by 3 committer and PMC memberso somewhat outdated (2010; covers Lucene 3.0)o http://www.manning.com/hatcher3/

Page 40: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Resources - Books

● Introduction to Information Retrievalo not specific to Lucene/Solr, but about IR conceptso free e-booko http://nlp.stanford.edu/IR-book/

● Managing Gigabyteso indexing, compression and other topicso accompanied by MG4J - a full-text search softwareo http://mg4j.di.unimi.it/

Page 41: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Resources - Web

● official websiteo http://lucene.apache.org/o Wikio reference guideo mailing list

● StackOverflowo http://stackoverflow.com/o “Lucene” and “Solr” tags

Page 42: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Getting Started

● download Solro requires Java 7 or newer to run

● Solr comes bundled/configured with Jettyo <Solr directory>/example/start.jar

● "exampledocs" directory contains sample documentso <Solr directory>/example/exampledocs/post.jaro java

-Durl=http://localhost:8983/solr/update -jar post.jar *.xml

● use the Solr admin interfaceo http://localhost:8983/solr/

Page 43: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)

Thanks for Coming!

● Java Performance Tips @ 10:15, same room● slides available

o http://bit.ly/sdcodecamp2014solr● please vote for my conference session

o http://bit.ly/tvnews2014● questions/feedback

o [email protected]● questions?