Dev8d Apache Solr Tutorial

Preview:

Citation preview

Small wins In a small time with Apache Solr

= Upayavira =

Who am I?

My (Buddhist) name is Upayavira

Consultant with Sourcesense, specialising in search and operational technologies

A member of the Apache Software Foundation

Who are Sourcesense?

Open Source integrator, specialising in:

Search

Business Intelligence

Content Management

Application Lifecycle Management

Offices in London, Amsterdam, Milan and Rome

Committers and ContributorsSearch:

Lucene/Solr – contributor

Hibernate Search – committer

Lucene Infinispan integration – lead developer

Apache UIMA – committer

CMS:

Apache Chemistry – contributor

Apache Jackrabbit – contributor

JBoss GateIn Portal – committer

OpenSSO-Alfresco - contributor

What is Lucene?

Lucene is a Java information retrieval library

Provides free text search facilities

Started in 2000, by Doug Cutting

A project of the Apache Software Foundation

It is designed to be embedded in Java apps

What is Solr?

Solr is an enterprise search server based on Lucene

Wraps Lucene with a RESTful web interface

Provides configurable schema

Provides replication functionality

Solr Design

Solrinstance

UpdateRequestHandler

SearchHandler

User queries

Luceneindex

contentapplication

Prerequisites

Java, preferably Java 6

Latest Apache Solr, currently 3.3

http://www.sourcesense.com/dev8d-solr.zip

PrerequisitesExtract your Solr distribution

At a command prompt:

cd into the unzipped distribution directory

cd into the example directory

Enter: java -jar start.jar

Visit http://localhost:8983/solr/ in a browser. If you see a welcome message, your Solr works

Unpack your dev8d-solr.zip file

At another command prompt, cd into your dev8d-solr directory

Checking Solr Works

Visit http://localhost:8983/solr/admin/

You should see the Solr admin page.

Click statistics link

You'll see NumDocs: 0

There's nothing in the index, so searches won't show much

So we need to index some sample content

Indexing Sample Content

In your dev8d-solr directory (extracted from the zip), at a command prompt:

Java -jar post.jar wikipedia-basic.xml

Searching

http://localhost:8983/solr/select?q=*:*

Searching

http://localhost:8983/solr/select?q=computers

Searching

http://localhost:8983/solr/select?q=computer systems

Searching

http://localhost:8983/solr/select?q=computers OR systems

Searching

http://localhost:8983/solr/select?q=computers AND systems

Searching

http://localhost:8983/solr/select?q="computer systems"

Searching

http://localhost:8983/solr/select?q="computer systems"~10

Searching

http://localhost:8983/solr/select?q=computers NOT data

Searching

http://localhost:8983/solr/select?q=computers -data

Searching

http://localhost:8983/solr/select/?q=computers&fl=title

Searching

http://localhost:8983/solr/select/?q=computers&fq=author:yobot

Searching

http://localhost:8983/solr/select/? q=computers&fq=author:yobot&fl=title,author

Searching

http://localhost:8983/solr/select/?q=computers&rows=10&start=10&fl=title

Searching

http://localhost:8983/solr/select/?q=title:system&fl=title

Searching

http://localhost:8983/solr/select/?q=computers&fl=title,author&sort=author+desc

Advanced Searching

http://localhost:8983/solr/select/?q=computers&facet=true&facet.field=author

http://localhost:8983/solr/select/?q=computers&facet=true&facet.field=author

Advanced Searching

http://localhost:8983/solr/select/?q=computers&facet=true&facet.field=author&rows=0&facet.sort=lex

Advanced Searching

http://localhost:8983/solr/select/?q=computers&facet=true&facet.field=author&rows=0&facet.sort=count

Advanced Searching

http://localhost:8983/solr/select/?q=computers&facet=true&facet.field=author&rows=0&facet.sort=count&facet.mincount=2

Advanced Searching

http://localhost:8983/solr/select/?q=computers&facet=true&facet.field=author&rows=0&facet.sort=count&facet.limit=3

Advanced Searching

http://localhost:8983/solr/select/?q=computers&facet=true&facet.field=author&rows=0&facet.sort=count&facet.limit=3&debugQuery=true

Advanced Searching

http://localhost:8983/solr/select?q=computer&wt=json

Advanced Searching

http://localhost:8983/solr/select?q=computer&wt=javabin

Advanced Searching

http://localhost:8983/solr/select?q=computer&hl=true&hl.fl=text

Advanced Searching

Look for list after main responses

Nothing there.

Edit 'text' field in schema.xml, changing it to stored=”true”

Reindex (java -jar post.jar wikipedia-enhanced.xml)

Advanced Searching

http://localhost:8983/solr/select?q=computer&hl=true&hl.fl=text

You should now see highlighted content

Advanced Searching

http://localhost:8983/solr/select?q=computer&hl=true&hl.fl=text&hl.simple.pre=<b>&hl.simple.post=</b>

Advanced Searching

Indexing

Indexing

Load wikipedia-basic.xml into a text editor or web browser

Load wikipedia-enhanced.xml into a text editor or browser

Load example/solr/conf/schema.xml into a text editor

Indexing

schema.xml defines field types and fields used in Solr

Equivalent to your database schema in a RDBMS

Indexing

Change this field in schema.xml to be of type “string” and add multiValued=”true” for each.

<field name="category" type="string" indexed="true" stored="true" multiValued="true"/>

Indexing

Now add this to the <fields> section of solrconfig.xml:

<field name="source" type="string" indexed="true" stored="true" multiValued="false"/>

<field name="text_general" type="text_general" indexed="true" stored="true" multiValued="true"/>

Now search for the “text_general” field type definition, further up in the file.

Indexing

At the bottom of solrconfig.xml add the following:

<copyField source="text" dest="text_general"/>

Indexing

In your window where Solr is running, press CTRL+Cto stop Solr, and then restart it with:

java -jar start.jar

Indexing

At your command prompt, in the dev8d directory, execute:

java -jar post.jar wikipedia-enhanced.xml

More Advanced Searching

http://localhost:8983/solr/select?q=computer%20AND%20babbage&facet=true&facet.field=category&facet.mincount=1

More Advanced Searching

http://localhost:8983/solr/terms?terms.fl=text&terms=true&terms.limit=20

More Advanced Searching

http://localhost:8983/solr/terms?terms.fl=text_general&terms=true&terms.limit=20

More Advanced Searching

http://localhost:8983/solr/terms?terms.fl=text_general&terms=true&terms.limit=20&terms.prefix=at

Indexing

Index segmentation: merge factor

Index optimisation: <optimize/>

schema.xml

Equivalent to RDBMS schema

Seen it before!

Let's look through it in more detail...

solrconfig.xml

Configures the components available to a Solr system

Specific to a Solr 'core', as is schema.xml

In same directory as schema.xml

Let's look through it in more detail...

Hints and Tips

Hints and Tips: Prototyping

Velocity response writer (/browse)

Data Import Handler (DIH)

XSLTUpdateRequestHandler (Solr 3.4)

Hints and Tips: Architecture

A RESTful service

An index, not a data store: keep ability to re-index

Don't make Solr do things you wouldn't have MySQL do

Hints and Tips: Security

There is none

So use a firewall

Beware what Solr internals you expose:

Query syntax

qt= parameter (e.g. qt=update)

Fake document level security with role fields and filter queries

Hints and Tips: Scaling

Index too large: distributed search

Too much traffic: replicated search

How much is too much: unanswerable!

Time for Questions

And your questions are...

thank youupayavira@sourcesense.com

Solr Host Configuration

shard 1

shard 2

shard 3

searches

Solr Host Configuration

shard 1

shard 2

shard 3

co-ordinator

Solr Host Configuration

shard 1

shard 2

shard 3

co-ordinator

load balancer

Solr Host Configuration

shard 1

shard 2

shard 3

co-ordinator

load balancer

shard 1

shard 2

shard 3

co-ordinator

Solr Host Configuration

shard 1

shard 2

shard 3

co-ordinator

load balancer

shard 1

shard 2

shard 3

co-ordinator

Recommended