60
What's New in Solr? code4lib 2011 preconference Bloomington, IN presented by Erik Hatcher of Lucid Imagination

code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Embed Size (px)

DESCRIPTION

code4lib 2011 preconference, presented by Erik Hatcher of Lucid Imagination. Abstract: The library world is fired up about Solr. Practically every next-gen catalog is using it (via Blacklight, VuFind, or other technologies). Solr has continued improving in some dramatic ways, including geospatial support, field collapsing/grouping, extended dismax query parsing, pivot/grid/matrix/tree faceting, autosuggest, and more. This session will cover all of these new features, showcasing live examples of them all, including anything new that is implemented prior to the conference.

Citation preview

Page 1: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

What's New in Solr?

code4lib 2011 preconferenceBloomington, IN

presented by Erik Hatcher of Lucid Imagination

Page 2: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

about mespoken at several code4lib conferences

Keynoted Athens '07 along with the pioneering Solr preconference,

Providence '09, "Rising Sun"

pre-conferenced Asheville '10, "Solr Black Belt"

co-authored "Lucene in Action", first edition; ghost/toast on second edition

Lucene and Solr committer.

library world claims to fame are founding and naming Blacklight, original developer on Collex and the Rossetti Archive search

now at Lucid Imagination, dedicated to Lucene/Solr support/services/training/etc

Page 3: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

abstract

The library world is fired up about Solr. Practically every next-gen catalog is using it (via Blacklight, VuFind, or other

technologies). Solr has continued improving in some dramatic ways, including geospatial support, field

collapsing/grouping, extended dismax query parsing, pivot/grid/matrix/tree faceting, autosuggest, and more. This

session will cover all of these new features, showcasing live examples of them all, including anything new that is

implemented prior to the conference.

Page 4: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

LIA2 - Lucene in Action

Published: July 2010 - http://www.manning.com/lucene/

New in this second edition:

Performing hot backups

Using numeric fields

Tuning for indexing or searching speed

Boosting matches with payloads

Creating reusable analyzers

Adding concurrency with threads

Four new case studies, and more

Page 5: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Version NumberWhich one ya talking 'bout, Willis?

3.1? 4.0?? TRUNK??

playing with fire

index format changes to be expected

reindexing recommended/required

Solr/Lucene merged development codebases

releases should occur lock-step moving forward

Page 6: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

dependencies

November 2009: Solr 1.4 (Lucene 2.9.1)

June 2010: Solr 1.4.1 (Lucene 2.9.3)

Spring 2011(?): Solr 3.1 (Lucene 3.1)

TRUNK: Solr 4.x (Lucene TRUNK)

Page 7: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

lucene

per-segment field cache, etc

Unicode and analysis improvements throughout

Analysis "attributes"

AutomatonQuery: RegexpQuery, WildcardQuery

flexible indexing

and so much more!

Page 8: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

README

Reindex!

Upgrade SolrJ libraries too (javabin format changed)

Read Lucene and Solr's CHANGES.txt files for all the details

Page 9: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Analysis

UAX, using ICU

CollationKey

PatternReplaceCharFilter

KeywordMarkerFilterFactory, StemmerOverrideFilterFactory

Page 10: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Standard tokenization

ClassicTokenizer: old StandardTokenizer

StandardTokenizer: now uses Unicode text segmentation specified by UAX#29

UAX29URLEmailTokenizer

maxTokenLength: default=255

Page 11: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

PathHierarchyTokenizer

delimiter: default=/

replace: default=<delimiter>

"/foo/bar" => [/foo] [/foo/bar]

Page 12: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

CollationKeyFilterA filter that lets one specify:

A system collator associated with a locale, or

A collator based on custom rules

This can be used for changing sort order for non-english languages as well as to modify the collation sequence for certain languages. You must use the same CollationKeyFilter at both index-time and query-time for correct results. Also, the JVM vendor, version (including patch version) of the slave should be exactly same as the master (or indexer) for consistent results.

http://wiki.apache.org/solr/UnicodeCollation

see also: ICUCollationKeyFilter

Page 13: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

ICU

International Components for Unicode

ICUFoldingFilter

ICUNormalizer2Filter

name=nfc|nfkc|nfkc_cf

mode=compose|decompose

filter

Page 14: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

ICUFoldingFilter

Accent removal, case folding,canonical duplicates folding,dashes folding,diacritic removal (including stroke, hook, descender), Greek letterforms folding, Han Radical folding, Hebrew Alternates folding, Jamo folding, Letterforms folding, Math symbol folding, Multigraph Expansions: All, Native digit folding, No-break folding, Overline folding, Positional forms folding, Small forms folding, Space folding, Spacing Accents folding, Subscript folding, Superscript folding, Suzhou Numeral folding, Symbol folding, Underline folding, Vertical forms folding, Width folding

Additionally, Default Ignorables are removed, and text is normalized to NFKC.

All foldings, case folding, and normalization mappings are applied recursively to ensure a fully folded and normalized result.

Page 15: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

ICUTransformFilterid: specific transliterator identifier from ICU's Transliterator#getAvailableIDs()(required)

direction=forward|reverse

Examples:

Traditional-Simplified: 簡化字 => 简化字

Cyrillic-Latin: Российская Федерация => Rossijskaâ Federaciâ

Page 16: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Tom Burton-West's latest

ICU

shingles

query parser

ABC -> [A] [B] [C] or [AB] [BC]...

Page 17: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

highlighter

deprecated old config, now config as standard search component

FastVectorHighlighter

Page 18: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

FastVectorHighlighter

if termVectors="true", termPositions="true", and termOffsets="true"

and hl.useFastVectorHighlighter=true

hl.fragListBuilder

hl.fragmentsBuilder

Page 19: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

spatialJTeam's plugin: packaged for easy deployment

Solr trunk capabilities

many distance functions

What's missing?

geo faceting? scoring by distance? distance pseudo-field?

All units in kilometers, unless otherwise specified

Page 20: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Spatial field types

Point: n-dimensional, must specify dimension (default=2), represented by N subfields internally

LatLon: latitude,longitude, represented by two subfields internally, single valued only

GeoHash: single string representation of lat/lon

Page 21: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Spatial query parsers

geofilt: exact filtering

bbox: uses (trie) range queries

Parameters:

sfield: spatial field

pt: reference point

d: distance

Page 22: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

field collapsing/groupingbackwards compatibility mode?

http://wiki.apache.org/solr/FieldCollapsing

group=true

group.field / group.func / group.query

rows / start: for groups, not documents

group.limit: number of results per group

group.offset: offset into doclist of each group

sort: how to sort groups, by top document in each group

group.sort: how to sort docs within each group

group.format: grouped | simple

group.main=true|false:

faceting works as normal

not distributed savvy yet

Page 23: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

query parsing

TextField: autoGeneratePhraseQueries="true"

if single string analyzes to multiple tokens

Page 24: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

{!raw|term|field f=$f}...Recall why we needed {!raw} from last year

<fieldType = .../> - use one string, one numeric, (and one text?)

<field name="..."/>

table for numeric and for string (and text?):

{!raw f=$f} | TermQuery(...)

{!term f=$f} | ...

{!field f=$f} | ...

Which to use when? {!raw} works for strings just fine, but best to migrate to the generally safer/wiser {!term} for future-proofing.

Page 25: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

{!term f=field}

fq={!term f=weight}1.5

Page 26: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

dismax

q.op or schema.xml's <solrQueryParser defaultOperator="[AND|OR]"/> defaults mm to 0% (OR) or 100% (AND)

#code4lib: issues with non-analyzed fields in qf

Page 27: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

edismaxSupports full lucene query syntax in the absence of syntax errors

supports "and"/"or" to mean "AND"/"OR" in lucene syntax mode

When there are syntax errors, improved smart partial escaping of special characters is done to prevent them... in this mode, fielded queries, +/-, and phrase queries are still supported.

Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.

advanced stopword handling... stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.

Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead of adding it in

Supports pure negative nested queries... so a query like +foo (-foo) will match all documents

Page 28: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

function queries

termfreq, tf, docfreq, idf, norm, maxdoc, numdocs

{!func}termfreq(text,ipod)

standard java.util.Math functions

Page 29: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

facetingper-segment, single-valued fields:

facet.method=fcs (field cache per segment)

facet.field={!threads=-1}field_name

threads=0: direct execution

threads=-1: thread per segment

speeds up single and multivalued method=fc, especially for deep paging with facet.offset

date faceting improvements, generalized for numeric ranges too

can now exclude main query q={!tag=main}the+query&facet.field={!ex=main}category

Page 30: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

pivot/grid/matrix/tree faceting

is this also "hierarchical faceting"? it depends!

Page 31: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

pivot faceting output/select?q=*:*&rows=0&facet=on&facet.pivot=cat,popularity,inStock&facet.pivot=popularity,cat

Page 32: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

spell checking

DirectSolrSpellChecker

no external index needed, uses automaton on main index

Page 33: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

spellcheck configsolrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent">

<str name="queryAnalyzerFieldType">textgen</str>

<!-- a spellchecker that uses no auxiliary index --> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="classname">solr.DirectSolrSpellChecker</str> <str name="minPrefix">1</str> </lst> </searchComponent>

Page 34: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

spellcheck handler

solrconfig.xml <requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="spellcheck">true</str> <str name="spellcheck.collate">true</str> </lst>

<arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>

Page 35: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

spellcheck response{ 'responseHeader'=>{ 'status'=>0, 'QTime'=>10, 'params'=>{ 'indent'=>'on', 'wt'=>'ruby', 'q'=>'ipud bluck'}}, 'response'=>{'numFound'=>0,'start'=>0,'docs'=>[] }, 'spellcheck'=>{ 'suggestions'=>[ 'ipud',{ 'numFound'=>1, 'startOffset'=>0, 'endOffset'=>4, 'suggestion'=>['ipod']}, 'bluck',{ 'numFound'=>1, 'startOffset'=>5, 'endOffset'=>10, 'suggestion'=>['black']}, 'collation','ipod black']}}

http://localhost:8983/solr/select?q=ipud%20bluck&wt=ruby&indent=on

Page 36: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

autosuggest

new "spellcheck" component, builds TST

collates query

can check if collated suggestions yield results, optionally, providing hit count

Page 37: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

suggest configsolrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent">

<str name="queryAnalyzerFieldType">textgen</str>

<lst name="spellchecker"> <str name="name">suggest</str> <str name="classname">org.apache.solr.spelling.suggest.Suggester</str> <str name="lookupImpl"> org.apache.solr.spelling.suggest.jaspell.JaspellLookup </str> <str name="field">suggest</str> <str name="buildOnCommit">true</str> </lst> </searchComponent>

schema.xml <field name="suggest" type="textgen" indexed="true" stored="false"/>

<copyField source="name" dest="suggest"/>

Page 38: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

suggest handler

solrconfig.xml <requestHandler class="solr.SearchHandler" name="/suggest"> <lst name="defaults"> <str name="spellcheck">true</str> <str name="spellcheck.dictionary">suggest</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.count">10</str> <str name="rows">0</str> <str name="spellcheck.maxCollationTries">20</str> <str name="spellcheck.maxCollations">10</str> <str name="spellcheck.collateExtendedResults">true</str> </lst> <arr name="components"> <str>query</str> <!-- to allow suggestion hit counts to be returned --> <str>spellcheck</str> </arr> </requestHandler>

Page 39: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

suggest response

{ 'responseHeader'=>{ 'status'=>0, 'QTime'=>2}, 'response'=>{'numFound'=>0,'start'=>0,'docs'=>[] }, 'spellcheck'=>{ 'suggestions'=>[ 'ip',{ 'numFound'=>1, 'startOffset'=>0, 'endOffset'=>2, 'suggestion'=>['ipod']}, 'collation',[ 'collationQuery','ipod', 'hits',3, 'misspellingsAndCorrections',[ 'ip','ipod']]]}}

http://localhost:8983/solr/suggest?q=ip&wt=ruby&indent=on

Page 40: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

sort

by function

&q=*:*&sfield=store&pt=39.194564,-86.432947&sort=geodist() asc

but still can't get value of function back

unless you force it to be the score somehow

Page 41: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

clustering component

now works out-of-the-box; all Apache license compatible

supports distributed search

Page 42: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

debug=true

debug=true|all|timing|query|results

debug=results&debug.explain.structured=true

Page 43: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

structured explain

'debug'=>{ 'explain'=>{ 'doc1'=>{ 'match'=>true, 'value'=>0.076713204, 'description'=>'fieldWeight(title:solr in 0), product of:', 'details'=>[{ 'match'=>true, 'value'=>1.0, 'description'=>'tf(termFreq(title:solr)=1)'}, { 'match'=>true, 'value'=>0.30685282, 'description'=>'idf(docFreq=1, maxDocs=1)'}, { 'match'=>true, 'value'=>0.25, 'description'=>'fieldNorm(field=title, doc=0)'}]}}}}

http://localhost:8983/solr/select?q=title:solr&debug.explain.structured=true&debug=results&wt=ruby&indent=on

Page 44: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

SolrCloud

shared/central config and core/shard managment via zookeeper,

built-in load balancing, and infrastructure for future SolrCloud work.

Page 45: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

/update/jsonsolrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/>

curl 'http://localhost:8983/solr/update/json?commit=true' -H 'Content-type:application/json' -d '{ "add": { "doc": { "id" : "MyTestDocument", "title" : "This is just a test" } }}'

Page 46: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

wt=csv

Writes only docs (no response header or response extras) in CSV format

Roundtrippable with /update/csv

provided all fields are stored

Page 47: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

UIMA

Unstructured Information Management Architecture

http://uima.apache.org/

New update processor chain, augmenting incoming documents from a UIMA annotator pipeline

http://wiki.apache.org/solr/SolrUIMA

Page 48: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

(solr|lucene)-dev

ant [idea|eclipse]

go!

http://wiki.apache.org/solr/HowToContribute

Page 49: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

works in progress

some interesting open issues (with patches):

PayloadTermQuery

XMLQueryParser plugin

join

Page 50: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

{!join from=$f to=$t}

insert <what Yonik said>

https://issues.apache.org/jira/browse/SOLR-2272

Page 51: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Lucid (imagination)What's Lucid done for you lately -

Yonik, Mark, Grant, Hoss: Lucene and Solr performance, faceting, grouping, join query, spatial, Mahout, ORP, PMC, etc, etc, etc

Other technical staff involved in mailing list assistance, bug reporting, contributing patches (hi Lance, Erick, Jay, Tom, Grijesh, Tomas....)

extended dismax, join, faceting performance improvements

LucidWorks Enterprise

Page 52: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Hoss Simplicity

http://www.lucidimagination.com/blog/2011/01/21/solr-powered-isfdb-part1/

http://www.lucidimagination.com/blog/2011/01/28/solr-powered-isfdb-part-2/

Page 53: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

LucidWorks Enterprise"lucid" query parser

click boosting

tunable norms, per-field

role filtering

administrative UI

REST API

Data sources, crawlers, and scheduling

Alerts

http://www.lucidimagination.com/enterprise-search-solutions/lucidworks

Page 54: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Community Questions

fire away!

Page 55: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

resources

duh!: #code4lib

lucene.apache.org/solr

search.lucidimagination.com/?q=<your query>

Page 56: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Q&A: faceting

why is paging through facets the way it is?

short-circuits on enum

Page 57: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Community:

- The state of Extended DisMax, and what Lucene features remain incompatible with it.

- Any developments on faceting (I've implemented the standard workaround to the "unknown facet list size" problem...  but I'd still love to be able to know exactly how long the lists are)

- Hierarchical documents in Solr -- I haven't followed the conversations closely, but I gather that this topic is gaining some momentum in the Solr community.

Page 58: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

contact info

erik.hatcher @ lucidimagination . com

http://www.lucidimagination.com

webinars, documentation

LucidFind: search.lucidimagination.com

search mailing list posts, wiki pages, web sites, our blog, etc for latest Lucene/Solr assistance

Page 59: code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Page 60: code4lib 2011 preconference: What's New in Solr (since 1.4.1)

re: code4lib