24
Lucene and Solr

Lucene and Solr

Embed Size (px)

Citation preview

Page 1: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 1/24

Lucene and Solr

Page 2: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 2/24

Lucene

◦ Doug Cutting Created in 1999

Donated to Apache in 2001

Features◦ Highly scalable

◦ Java (1.4)

◦ Ports to many other languages

◦ No crawler◦ No document parsing

◦ No “PageRank” 

Page 3: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 3/24

Lucene

◦ Powered by Lucene

IBM Omnifind Y! Edition

Technorati

Wikipedia Internet Archive

LinkedIn

monster.com

Page 4: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 4/24

Indexing

Logical structure◦ Index is collection of documents

◦ Documents are a collection of fields

◦ Fields are the content

Stored – Stored verbatim for retrival with results Indexed – Tokenized and made searchable

◦ Indexed terms stored in inverted index

Physical structure◦ Multiple documents (with all fields) stored in

segments mergeFactor

◦  All segments together make up the index

IndexWriter is interface object for entire index

Page 5: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 5/24

Indexing

aardvark

hood

red

little

riding

robin

women

zoo

Little Red Riding Hood

Robin Hood

Little Women

0 1

0 2

0

0

2

1

0

1

2

Page 6: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 6/24

Indexing

 Analysis

◦ Extract tokens from text (tokenizer)

Whitespace

Hyphens◦ Manipulate or modify tokens (token filter)

Stemming

Removal

◦ Tokenizer / Token Filter chains are called

analyzers

Page 7: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 7/24

LexCorp BFG-9000

LexCorp BFG-9000

BFG 9000Lex Corp

LexCorp

bfg 9000lex corp

lexcorp

WhitespaceTokenizer

WordDelimiterFilter catenateWords=1

LowercaseFilter

Indexing

Page 8: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 8/24

Searching

Query Creation

◦ Query parser

◦ Manual query construction from terms

◦ title:“Bell” author:“Hemmingway”^3.0 

Query terms are analyzed

◦ Same analyzer for indexing and searching

on each field

Page 9: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 9/24

Searching

LexCorp BFG-9000

LexCorp BFG-9000

BFG 9000Lex Corp

LexCorp

bfg 9000lex corp

lexcorp

WhitespaceTokenizer

WordDelimiterFilter catenateWords=1

LowercaseFilter

Lex corp bfg9000

Lex bfg9000

bfg 9000Lex corp

bfg 9000lex corp

WhitespaceTokenizer

WordDelimiterFilter catenateWords=0

LowercaseFilter

A Match!

corp

Page 10: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 10/24

Searching

Many query types Term

Phrase “bad wolf” 

Proximity “quick fox”~4 

Prefix pla?e (plate or place or plane)

practic* (practice or practical orpractically)

Fuzzy (edit distance)

planting~0.75 (granting or planning)

roam~ (default is 0.5)

Range date:[05072007 TO 05232007] (inclusive)

author: {king TO mason} (exclusive)

Page 11: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 11/24

Searching

Multiple searchers at once

◦ Thread safe

 Additions or deletions to index are not

reflected in already open searchers◦ Must be closed and reopened

Use commit or optimize on

indexWriter

Page 12: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 12/24

Lucene Sub-projects

Nutch

◦ Web crawler with document parsing

Hadoop

◦ Distributed data processor

◦ Implements MapReduce

Solr

Page 13: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 13/24

Solr

◦ Yonik Seeley Developed at CNET

Donated to Apache in 2006

Features◦ Servlet

◦ Web Administration Interface◦ XML/HTTP, JSON Interfaces

◦ Faceting

◦ Schema to define types and fields

◦ Highlighting

◦ Caching

◦ Index Replication (Master / Slaves)

◦ Pluggable

◦ Java 5

Page 14: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 14/24

Solr

◦ Powered by Solr Netflix

CNET

Smithsonian

 AOL:sports and music

RightNow ??

Drupal module

GameSpot

Page 15: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 15/24

Configuration (solrconfig.xml)

<mainIndex><useCompoundFile>false</useCompoundFile>

<mergeFactor>10</mergeFactor>

<maxBufferedDocs>1000</maxBufferedDocs>

<maxMergeDocs>2147483647</maxMergeDocs>

<maxFieldLength>10000</maxFieldLength>

</mainIndex>

<requestHandler name="standard" class="solr.StandardRequestHandler" />

<requestHandler name=“custom" class="your.package.CustomRequestHandler" /> 

<autoCommit>

<maxDocs>10000</maxDocs>

<maxTime>1000</maxTime>

</autoCommit>

<queryResponseWriter name="xml" class="org.apache.solr.request.XMLResponseWriter"

default="true"/>

Page 16: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 16/24

Schema (schema.xml)

Fields<uniqueKey>id</uniqueKey>

<field name="products" type="text" indexed="true" stored=“true"/> 

<field name="keywords" type="text_ws" indexed="true" stored=“true”/> 

<field name="keywordsSorted" type="text_sorted" indexed="true" stored="false"/><field name="timestamp" type="date" indexed="true" stored="true" default="NOW"/>

<dynamicField name="*_i" type="integer" indexed="true" stored="true"/>

<dynamicField name="desc_*" type="string" indexed="true" stored="false"/>

<copyField source=“keywords" dest=“keywordsSorted"/> 

Page 17: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 17/24

Schema

 Analyzers<fieldtype name="nametext" class="solr.TextField">

<analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>

</fieldtype>

<fieldtype name="text" class="solr.TextField">

<analyzer><tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.StandardFilterFactory"/><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.StopFilterFactory"/><filter class="solr.PorterStemFilterFactory"/>

</analyzer>

</fieldtype>

<fieldtype name="myfieldtype" class="solr.TextField">

<analyzer><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.SnowballPorterFilterFactory" language="German" />

</analyzer>

</fieldtype>

Page 18: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 18/24

Insertion

◦ HTTP POST to http://localhost:8983/solr/update/

<add>

<doc>

<field name="employeeId">05991</field>

<field name="office">Bridgewater</field>

<field name="skills">Perl</field>

<field name="skills">Java</field>

</doc>

[<doc> ... </doc>[<doc> ... </doc>]]

</add>

Documents or fields can have boosts attached

Page 19: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 19/24

Update / Delete

Inserting a document with alreadypresent uniqueKey will erase the

original

Deleting◦ By uniqueKey field

<delete><id>05991</id></delete>

◦ By query<delete><query>name:Anthony</query></delete>

<Commit/>

<Optimize/>

Page 20: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 20/24

Page 21: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 21/24

Search

Faceting◦ Available in StandardRequestHandler and

DisMaxRequestHandler

Page 22: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 22/24

Search

http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=-1&facet.field=cat&facet.mincount=1&facet.field=inStock

<response>

<responseHeader>

<status>0</status>

<QTime>3</QTime>

</responseHeader>

<result numFound="4" start="0"/>

<lst name="facet_counts"><lst name="facet_queries"/>

<lst name="facet_fields">

<lst name="cat">

<int name="music">1</int>

<int name="connector">2</int>

<int name="electronics">3</int>

</lst>

<lst name="inStock">

<int name="false">3</int><int name="true">1</int>

</lst>

</lst>

</lst>

</response>

Page 23: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 23/24

Many more features

Replication◦ Master / Slave architecture for load

balancing and backups

More-like-this Easy to add RequestHandlers and

ResponseWriters

Responses in many formats Hit highlighting

Page 24: Lucene and Solr

8/13/2019 Lucene and Solr

http://slidepdf.com/reader/full/lucene-and-solr 24/24

Sources

http://lucene.apache.org/

http://lucene.apache.org/solr/

http://people.apache.org/~yonik/presentations/