An Introduction to Solr

Searching with SolrTom Hill

t.hill@worldware.comeBig Java SIG, June 18th, 2008

Tonight's Talk

•Tonight's Talk should run about 1 1/2 hours

•About Solr

•Background & overview

•Installing & Bringing Up Solr

•Rest Interface & Java Client

•Configuring Solr

Why Implement Search?

•Does your site need search?

•Do you need to implement it, or is Google enough?

• Just text or Structured Data?

• Do you need to control ranking?

What is Solr?

•Web application for text search

•A wrapper around Apache Lucene

•Lucene is a library (.jar file)

•Solr is a web app (.war file)

•Written at CNet, now at Apache

What is Lucene?

•Text search library in Java

•Fast, feature rich.

•Written by Doug Cutting

•Active Apache development community

•Versions also in C++, C#, Ruby, Python, Delphi, Lisp, etc...

Why Solr?

•Reliable

•Fast

•Supported

•Open Source

•Tunable Scoring

Solr Versions

•Current Version is 1.2

• A year old

• 1.3 is coming "sometime"

•Large number of features in HEAD

• Use the latest from subversion for new projects

Alternatives to Solr

•Just Use Google

•Use Lucene

•Use Your Database

•Commercial Libraries

•Write your own

What Solr is Not

•A replacement for a relational database

•An embedded database*

•Fully cross platform :-(

• Replication depends on unix FS

• Admin scripts are bash(minor)

Solr Sites

•CNet (Reviews & Products)

•Internet Archive (Collections)

•Netflix (Movies)

•Zvents (Events)

•StripSearch.ws (Comics)

•And many more

Features

Here's a quick look at some of the features of Solr,as implemented on Zvents.com

Faceted Navigation

Groups the results by categoryCan do multiple facets at once Returns matching counts

Additional Constraints

Synonyms, etc.

Solr Overview

Simple WebappWeb Servers[1..n]Web Servers[1..n]Database MasterDatabase Master

Database Slaves[0..n]Database Slaves[0..n]

Solr MasterSolr Master

Solr Slaves[0..n]Solr Slaves[0..n]

Scaling Solr

•Master/Slave architecture

•Writes to master/reads to slaves

•Replication: Periodic transfers, not continuous

•Rsync

Updates

•Updates flush caches, bad for performance

•Master therefor much slower than slaves

• So send all queries to slaves

•Depends on your update rates

Solr's Data Model

• Solr maintains a collection of documents

• A document is a collection of fields & values

• A field can occur multiple times in a document

• Documents are immutable.

• They can be deleted, and a new version added, however.

Querying

•Http request

• http://localhost:8080/comix/select/?q=java

Solr Query Syntax

•Lucene Query Syntax + a bit

•paris

•city:paris

•title:"The Right Way" AND text:go

•id:[* TO *]

Solr Query Syntax II

•-inStock:false

•te?t

•theat*

•te*t

•test~

Using Solr

•Getting data into Solr

•Getting data out of Solr

Getting Data Into Solr•POST it.

<add> <doc> <field name="employeeId">05991</field> <field name="office">Bridgewater</field> <field name="skills">Perl</field> <field name="skills">Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]]</add>

Committing

•Nothing shows up in the index until you commit

•You can just POST <commit/> to http://host:port/solr/update

Getting Data Outhttp://localhost:8080/comix/select/?q=data&indent=on<response><lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="indent">on</str> <str name="q">data</str> </lst></lst><result name="response" numFound="2" start="0"> <doc> <str name="id">strip.3136</str> <str name="release_date">1992-05-07</str> <date name="timestamp">2008-02-28T10:06:01.682Z</date> <str name="type">strip</str> </doc> </result></response>

Getting Data Outhttp://localhost:8080/comix/select/?q=data&indent=on<response><lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="indent">on</str> <str name="q">data</str> </lst></lst><result name="response" numFound="2" start="0"> <doc> <str name="id">strip.3136</str> <str name="release_date">1992-05-07</str> <date name="timestamp">2008-02-28T10:06:01.682Z</date> <str name="type">strip</str> </doc> ... </result></response>

Getting Data Outhttp://localhost:8080/comix/select/?q=data&indent=on{ "responseHeader":{ "status":0, "QTime":1, "params":{

"wt":"json","rows":["1", "1"],"start":"0","indent":"on","q":"data","version":"2.2"}},

"response":{"numFound":2,"start":0,"docs":[{ "feature_id":"3", "release_date":"1992-05-07", "id":"strip.3136", "timestamp":"2008-02-28T10:06:01.682Z"}]

JSON format

Debug Query Option

•Add &debugQuery=on to request params

•Returns parsed form of query

<str name="rawquerystring">c.i.a</str><str name="querystring">c.i.a</str><str name="parsedquery">PhraseQuery(text:"c i a")</str><str name="parsedquery_toString">text:"c i a"</str>

Debug Query Option II

•Add &debugQuery=on to request params

•Returns scoring information

<str name="id=strip.2781,internal_docid=29854"> 2.6219895 = (MATCH) fieldWeight(text:calvin in 29854), product of: 1.0 = tf(termFreq(text:calvin)=1) 2.6219895 = idf(docFreq=6222) 1.0 = fieldNorm(field=text, doc=29854)</str> <str name="id=strip.4078,internal_docid=31151"> 2.6219895 = (MATCH) fieldWeight(text:calvin in 31151), product of: 1.0 = tf(termFreq(text:calvin)=1) 2.6219895 = idf(docFreq=6222) 1.0 = fieldNorm(field=text, doc=31151)</str>

Deleting Data

<delete><id>35</id></delete><delete><query>city:paris</query></delete>

•POST

Command Line Control

curl http://localhost:8983/solr/update -H "Content-type: text/xml" --data-binary '<commit/>'<?xml version="1.0" encoding="UTF-8"?><response><lst name="responseHeader"><int name="status">0</int>

Solr in 3 minutes!

•Download Solr from Apache•Untar•"ant example"•Start the example app•Load data into Solr•Query

Solr in Ten Minutes

•Copy Solr's example/solr dir to /var/solr •Edit schema.xml and solrconfig.xml•Load data into Solr

•In $CATALINA_HOME/conf/Catalina/localhost/foo.xml

Directory Layout

•${solr.home}/conf

•schema.xml

•solrconfig.xml

•${solr.home}/data

•${solr.home}/logs

•${solr.home}/bin

Java Solr Client•Called SolrJ

•Not in Solr 1.2.

• I grabbed from the HEAD from svn

• Works with Solr 1.2

•Add/Delete/Query/Commit/Optimize

Adding Docs w/SolrJGiven Map<String, String> fields;

CommonsHttpSolrServer server = new CommonsHttpSolrServer(url);

SolrInputDocument doc=new SolrInputDocument();for (Map.Entry<String, String> e :fields.entrySet()){ doc.addField(e.getKey(), e.getValue());}

UpdateResponse res = server.add( doc);

Deleting Docs w/SolrJ

UpdateResponse res;

res =server.deleteById("100");res =server.deleteByQuery("city:paris");

Simple QueryCommonsHttpSolrServer server= new CommonsHttpSolrServer(url);

SolrQuery query = new SolrQuery();

query.setQuery("dance");

QueryResponse rsp = server.query(query);

More Interesting Query

SolrQuery query = new SolrQuery();query.setQuery("dance");query.setFacet(true);query.addFacetField("city");query.setFacetMinCount(1);query.addSortField( "price", SolrQuery.ORDER.asc );QueryResponse rsp = server.query(query);

Query Responses

QueryResponse qr = server.query(query);SolrDocumentList docs = qr.getResults();List<FacetField> lf = qr.getFacetFields();for (FacetField ff: lf) { String fieldName = ff.getName(); List<FacetField.Count> lc = ff.getValues(); for (FacetField.Count c: lc) { String countName = c.getName();long count = c.getCount();

Other Commands

•Commit•server.commit()

•Optimize•server.optimize()

•Not too complicated!

Request Handlers•Request handler define how the

query is processed.

•Two main types

•StandardRequestHandler

•DisMaxRequestHandler

•You can implement your own

•Changing in Solr 1.3

"Standard" Request Handler

•Accepts Solr Query Syntax

•I tend to use it for my queries, not user queries.

DisMaxRequestHandler

•Recommended for user queries

•Allows simple users keywords to be applied to multiple fields, with weighting.

•Boost Functions

•Boost Queries

Boost Functions

•Allow you to influence scoring at run time

•Computationally Expensive!

•Really useful for tuning scoring

•linear(x,2,4) returns 2*x+4

• x is a field

The Solr Schema

•schema.xml

•Defines types used in this webapp

•Defines the fields and their types

•Defines "copyFields"

•READ THE EXAMPLE SCHEMA.XML

Types•Types define processing for a field

•How the words are split (Whitespace? Punctuation? CIA != C.I.A.)

•Stemming

•Case Folding, etc

•Predefined date, int, float, etc

Analysis: Index and Query Time

•Types have two modes

•Index Time

•Query Time

Simple Text Field

Analysis & Facets

•Make sure to use an untokenized field for faceting.

•"San Jose" != "San" "Jose"

Fields

•Elements of a document

•Both predefined & dynamic

•Fields may occur multiple times

•Maybe indexed and/or stored

Example Fields

Copy Fields

•Two main uses

•To analyze a field in two different ways

•To concatenate fields

The Solr Config File

•solrconfig.xml

•Defines request handlers, defaults, caches,

•Read the example solrconfig.xml

Configuring DisMax

•Parameter defaults set in solrconfig.xml

•Can be overridden in each request

•Except for params labeled invariant

DisMax Config Example

<requestHandler name="dismax" class="solr.DisMaxRequestHandler" > <lst name="defaults"> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 </str> <str name="pf"> text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9 </str>... </requestHandler>

<requestHandler name="dismax" class="solr.DisMaxRequestHandler" > ... <str name="bf"> ord(poplarity)^0.5 recip(rord(price),1,1000,1000)^0.3 </str> <str name="fl"> id,name,price,score </str>... </requestHandler>

Wrap Up

Resources

• Solr http://lucene.apache.org/solr/

• wiki, mailing list, jira (bugs/features)

• Lucene http://lucene.apache.org/

Lucene In Action

Building Search Applications with Lucene, lingpipe and Gate

Manu Konchady

An Introduction to Solr

Technology

NYC Lucene/Solr Meetup: Spark / Solr

Optimizing SOLR to Improve Search - Magentoinfo2.magento.com/rs/magentoenterprise/images/SOLR... · 2020-06-08 · Agenda ! Overview of SOLR ! Basic Solr Troubleshooting – Common

Introduction to Apache Solr

Apache Solr Cookbook - the-eye.euApache Solr Cookbook iii 4 Solr autocomplete example 27 4.1 Install Apache Solr

Inside Solr 5 - Bangalore Solr/Lucene Meetup

Using Solr Cloud to Tame an Index Explosion

Website Search with Apache Solr - leximation.com€¦ · Solr is an incredibly powerful and full featured search platform that can be implemented in stages Solr does require development

Optimizing SOLR to Improve Searchinfo2.magento.com/rs/magentosoftware/images/SOLR... · Agenda ! Overview of SOLR ! Basic Solr Troubleshooting – Common SOLR Troubleshooting and

TYPO3 Camp Poznan - Solr Usecases with Hosted Solr

Introduction to Open Source Search with Apache Lucene and Solr

Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Oak / Solr integration Tommaso Teofili · Oak / Solr integration Tommaso Teofili . adaptTo() 2012 ! Why ! Search on Oak with Solr ! Solr based QueryIndex ! Solr based MK ! Benchmarks

Solr Exchange: Introduction to SolrCloud

Coffee at DBG- Solr introduction

Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Solr - home.apache.orgpeople.apache.org/~yonik/presentations/Solr_notes.pdf · solr/data/index Master solr/data/index Searcher new segment solr/data/snapshot-2006062950000 1. hard

An Introduction to Basics of Search and Relevancy with Apache Solr

Introduction to - Centrum für Informations- und ...hs/teach/14s/ir/solr.pdfIntroduction to Lucene & Solr Getting started – Indexing using Solr – Updating & deleting files –

Introduction to - uni-muenchen.dehs/teach/14s/ir/solr.pdf4 What is Solr? Solr is: – An open source enterprise search server – Based on the Lucene Java search library – A web

Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach