View
109
Download
2
Category
Preview:
DESCRIPTION
Slides for my presentation at SoCal Code Camp, June 29, 2014 (http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6337660f-37de-4d6e-a5bc-46ba54478e5e)
Citation preview
Search Engine-Building with Lucene and Solr
Kai ChanSoCal Code Camp, June 2014
http://bit.ly/sdcodecamp2014solr
all data
matched data
data that a user actually sees
Lucene
● full-text search library● creates, updates and read from the index● takes queries and produces search results● your application creates objects and calls
methods in the Lucene API● provides building blocks for custom features
Solr
● full-text search platform● uses Lucene for indexing and search● REST-like API over HTTP● different output formats (e.g. XML, JSON)● provides some features not built into Lucene
machine running Java VM
your application
machine running Java VM
servlet container (e.g. Tomcat, Jetty)
Solr
Solr code
Lucene code libraries
index
Lucene
Lucene code
indexlibraries
clientHTTP
Lucene:
Solr:
How Data Are Organized
collection
document document document
field
field
field
field
field
field
field
field
field
field
content (e.g. "please read" or 30)
name (e.g. "title" or "price")
type
options
collection
document document document
subject
date
from
subject
date
from
date
from
text text
reply-to
text
reply-to
collection
document document document
subject
date
from
title
SKU
price
last name
phone
text description
first name
address
Solr Field Definition
● fieldo name (e.g. "subject")o type (e.g. "text_general")o options (e.g. indexed="true" stored="true")
● field typeo text: "string", "text_general"o numeric: "int", "long", "float", "double"
● optionso indexed: content can be searchedo stored: content can be returned at search-timeo multivalued: multiple values per field & document
Solr Dynamic Field
● define field by naming convention● "amount_i": int, index, stored● "tag_ss": string, indexed, stored, multivalued
Solr Copy Field
● copy one or more fields into another field● can be used to define a catch-all field
o source: "title", "author", "content"o destination: "text"o searching the "text" field has the effect of searching
all the other three fields
Indexing - UpdateRequestHandler
● upload (POST) content or file to http://host:port/solr/update
● formats: XML, JSON, CSV
Indexing - DataImportHandler
● has its own config file (data-config.xml)● import data from various sources
o RDBMS (JDBC)o e-mail (IMAP)o XML data locally (file) or remotely (HTTP)
● transformers o extract data (RegEx, XPath)o manipulate data (strip HTML tags)
Indexing - ExtractingRequestHandler
● allows indexing of different formatso e.g. PDF, MS Word, XML
● extract text and metadata● maps extracted text to the “content” field● maps metadata to different fields
Searching - Basics
● send request to http://host:port/solr/search● parameters
o q - main queryo fq - filter queryo defType - query parser (e.g. lucene, edismax)o fl - fields to returno sort - sort criteriao wt - response writer (e.g. xml, json)o indent - set to true for pretty-printing
http://localhost:8983/solr/select?q=title:tablet&fl=title,price,inStock&sort=price&wt=json
search handler's URL main query
response writersort criteriafields to return
Searching - Query Syntax
name:tablet
name:”galaxy tab”name:tablet category:tablet
+name:tablet +category:tablet
Searching - Query Syntax (cont.)
+name:tablet +(manu:apple manu:samsung)
+name:tablet -manu:apple
+name:tablet +range:[300 TO 500]
+name:tablet manu:apple^5
EDisMax Parser
● suitable for user-generated querieso does not complain about the syntaxo does not require field name in queryo searches across several fields
● configurable
● default: sorting by decreasing score● custom sorting rules: use the sort parameter
o syntax: fieldName (asc|desc)o e.g. sort by ascending price (i.e. lowest price
first):price asco e.g. sort by descending date (i.e. newest date
first):date asc
Sorting
Sorting
● multiple fields and orders: separate by commaso e.g. sort by descending starRating and ascending
price:o starRating desc, price asc
Sorting
● cannot use multivalued fields● overrides the default sorting behavior
Faceted Search
● facet values: (distinct) values (generally non-overlapping) ranges of a field
● displaying facetso show possible valueso let users narrow down their searches easily
facet
facet values (5 of them)
Faceted Search
● set facet parameter to true - enables faceting
● other parameterso facet.field - use the field's values as facets
return <value, count> pairso facet.query - use the given queries as facets
return <query, count> pairso facet.sort - set the ordering of the facets;
can be "count" or "index"o facet.offset and face.limit - used for
pagination of facets
Spatial Search
● data: locations (longitudes, latitudes)● search: filter and/or sort by location
Filter by Location
● geofilto circle centered at a given pointo distance from a given pointo fq={!geofilt sfield=store}&pt=45.15,-
93.85&d=5● bbox
o square (“bounding box”) centered at a given pointo distance from a given point + cornerso fq={!bbox sfield=store}&pt=45.15,-
93.85&d=5
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
geofilt bbox
5 km 5 km
(45.15, -93.85) (45.15, -93.85)
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
geofilt bbox
5 km 5 km
(45.15, -93.85) (45.15, -93.85)
x
o
o
x
x
x
o
o
o
o
x
o
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
Sort by Location
● geodisto returns the distance between the location given in a
field and a certain coordinateo e.g. sort by ascending distance from (45.15,-93.85),
and return the distances as the score:q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
Scaling/Redundancy
problem solution
collection too large for a single machine
distribution
too many requests for a single machine
distribution
a machine can go down replication
SolrCloud
● Solr instanceso collection (logical index) divided into one or more
partial collections (“shards”)o for each shard, one or more Solr instances keep
copies of the data one as leader - handles reads and writes others as replicas - handle reads
● ZooKeeper instances
SolrCloud
● Solr instances● ZooKeeper instances
o management of Solr instanceso leader electiono node discovery
leader replica replica
leader replica
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
leader replica replica
leader replica
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
leader replica replica
(offline) leader
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
leader replica replica
replica leader
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
Resources - Books
● Solr in Actiono just released, up-to-dateo http://www.manning.com/grainger/
● Apache Solr 4 Cookbooko common problems and useful tipso http://www.packtpub.com/apache-solr-4-cookbook/b
ook● Lucene in Action
o written by 3 committer and PMC memberso somewhat outdated (2010; covers Lucene 3.0)o http://www.manning.com/hatcher3/
Resources - Books
● Introduction to Information Retrievalo not specific to Lucene/Solr, but about IR conceptso free e-booko http://nlp.stanford.edu/IR-book/
● Managing Gigabyteso indexing, compression and other topicso accompanied by MG4J - a full-text search softwareo http://mg4j.di.unimi.it/
Resources - Web
● official websiteo http://lucene.apache.org/o Wikio reference guideo mailing list
● StackOverflowo http://stackoverflow.com/o “Lucene” and “Solr” tags
Getting Started
● download Solro requires Java 7 or newer to run
● Solr comes bundled/configured with Jettyo <Solr directory>/example/start.jar
● "exampledocs" directory contains sample documentso <Solr directory>/example/exampledocs/post.jaro java
-Durl=http://localhost:8983/solr/update -jar post.jar *.xml
● use the Solr admin interfaceo http://localhost:8983/solr/
Thanks for Coming!
● Java Performance Tips @ 10:15, same room● slides available
o http://bit.ly/sdcodecamp2014solr● please vote for my conference session
o http://bit.ly/tvnews2014● questions/feedback
o kai@ssc.ucla.edu● questions?
Recommended