Upload
erik-hatcher
View
3.653
Download
2
Embed Size (px)
DESCRIPTION
code4lib 2011 preconference, presented by Erik Hatcher of Lucid Imagination. Abstract: The library world is fired up about Solr. Practically every next-gen catalog is using it (via Blacklight, VuFind, or other technologies). Solr has continued improving in some dramatic ways, including geospatial support, field collapsing/grouping, extended dismax query parsing, pivot/grid/matrix/tree faceting, autosuggest, and more. This session will cover all of these new features, showcasing live examples of them all, including anything new that is implemented prior to the conference.
Citation preview
What's New in Solr?
code4lib 2011 preconferenceBloomington, IN
presented by Erik Hatcher of Lucid Imagination
about mespoken at several code4lib conferences
Keynoted Athens '07 along with the pioneering Solr preconference,
Providence '09, "Rising Sun"
pre-conferenced Asheville '10, "Solr Black Belt"
co-authored "Lucene in Action", first edition; ghost/toast on second edition
Lucene and Solr committer.
library world claims to fame are founding and naming Blacklight, original developer on Collex and the Rossetti Archive search
now at Lucid Imagination, dedicated to Lucene/Solr support/services/training/etc
abstract
The library world is fired up about Solr. Practically every next-gen catalog is using it (via Blacklight, VuFind, or other
technologies). Solr has continued improving in some dramatic ways, including geospatial support, field
collapsing/grouping, extended dismax query parsing, pivot/grid/matrix/tree faceting, autosuggest, and more. This
session will cover all of these new features, showcasing live examples of them all, including anything new that is
implemented prior to the conference.
LIA2 - Lucene in Action
Published: July 2010 - http://www.manning.com/lucene/
New in this second edition:
Performing hot backups
Using numeric fields
Tuning for indexing or searching speed
Boosting matches with payloads
Creating reusable analyzers
Adding concurrency with threads
Four new case studies, and more
Version NumberWhich one ya talking 'bout, Willis?
3.1? 4.0?? TRUNK??
playing with fire
index format changes to be expected
reindexing recommended/required
Solr/Lucene merged development codebases
releases should occur lock-step moving forward
dependencies
November 2009: Solr 1.4 (Lucene 2.9.1)
June 2010: Solr 1.4.1 (Lucene 2.9.3)
Spring 2011(?): Solr 3.1 (Lucene 3.1)
TRUNK: Solr 4.x (Lucene TRUNK)
lucene
per-segment field cache, etc
Unicode and analysis improvements throughout
Analysis "attributes"
AutomatonQuery: RegexpQuery, WildcardQuery
flexible indexing
and so much more!
README
Reindex!
Upgrade SolrJ libraries too (javabin format changed)
Read Lucene and Solr's CHANGES.txt files for all the details
Analysis
UAX, using ICU
CollationKey
PatternReplaceCharFilter
KeywordMarkerFilterFactory, StemmerOverrideFilterFactory
Standard tokenization
ClassicTokenizer: old StandardTokenizer
StandardTokenizer: now uses Unicode text segmentation specified by UAX#29
UAX29URLEmailTokenizer
maxTokenLength: default=255
PathHierarchyTokenizer
delimiter: default=/
replace: default=<delimiter>
"/foo/bar" => [/foo] [/foo/bar]
CollationKeyFilterA filter that lets one specify:
A system collator associated with a locale, or
A collator based on custom rules
This can be used for changing sort order for non-english languages as well as to modify the collation sequence for certain languages. You must use the same CollationKeyFilter at both index-time and query-time for correct results. Also, the JVM vendor, version (including patch version) of the slave should be exactly same as the master (or indexer) for consistent results.
http://wiki.apache.org/solr/UnicodeCollation
see also: ICUCollationKeyFilter
ICU
International Components for Unicode
ICUFoldingFilter
ICUNormalizer2Filter
name=nfc|nfkc|nfkc_cf
mode=compose|decompose
filter
ICUFoldingFilter
Accent removal, case folding,canonical duplicates folding,dashes folding,diacritic removal (including stroke, hook, descender), Greek letterforms folding, Han Radical folding, Hebrew Alternates folding, Jamo folding, Letterforms folding, Math symbol folding, Multigraph Expansions: All, Native digit folding, No-break folding, Overline folding, Positional forms folding, Small forms folding, Space folding, Spacing Accents folding, Subscript folding, Superscript folding, Suzhou Numeral folding, Symbol folding, Underline folding, Vertical forms folding, Width folding
Additionally, Default Ignorables are removed, and text is normalized to NFKC.
All foldings, case folding, and normalization mappings are applied recursively to ensure a fully folded and normalized result.
ICUTransformFilterid: specific transliterator identifier from ICU's Transliterator#getAvailableIDs()(required)
direction=forward|reverse
Examples:
Traditional-Simplified: 簡化字 => 简化字
Cyrillic-Latin: Российская Федерация => Rossijskaâ Federaciâ
Tom Burton-West's latest
ICU
shingles
query parser
ABC -> [A] [B] [C] or [AB] [BC]...
highlighter
deprecated old config, now config as standard search component
FastVectorHighlighter
FastVectorHighlighter
if termVectors="true", termPositions="true", and termOffsets="true"
and hl.useFastVectorHighlighter=true
hl.fragListBuilder
hl.fragmentsBuilder
spatialJTeam's plugin: packaged for easy deployment
Solr trunk capabilities
many distance functions
What's missing?
geo faceting? scoring by distance? distance pseudo-field?
All units in kilometers, unless otherwise specified
Spatial field types
Point: n-dimensional, must specify dimension (default=2), represented by N subfields internally
LatLon: latitude,longitude, represented by two subfields internally, single valued only
GeoHash: single string representation of lat/lon
Spatial query parsers
geofilt: exact filtering
bbox: uses (trie) range queries
Parameters:
sfield: spatial field
pt: reference point
d: distance
field collapsing/groupingbackwards compatibility mode?
http://wiki.apache.org/solr/FieldCollapsing
group=true
group.field / group.func / group.query
rows / start: for groups, not documents
group.limit: number of results per group
group.offset: offset into doclist of each group
sort: how to sort groups, by top document in each group
group.sort: how to sort docs within each group
group.format: grouped | simple
group.main=true|false:
faceting works as normal
not distributed savvy yet
query parsing
TextField: autoGeneratePhraseQueries="true"
if single string analyzes to multiple tokens
{!raw|term|field f=$f}...Recall why we needed {!raw} from last year
<fieldType = .../> - use one string, one numeric, (and one text?)
<field name="..."/>
table for numeric and for string (and text?):
{!raw f=$f} | TermQuery(...)
{!term f=$f} | ...
{!field f=$f} | ...
Which to use when? {!raw} works for strings just fine, but best to migrate to the generally safer/wiser {!term} for future-proofing.
{!term f=field}
fq={!term f=weight}1.5
dismax
q.op or schema.xml's <solrQueryParser defaultOperator="[AND|OR]"/> defaults mm to 0% (OR) or 100% (AND)
#code4lib: issues with non-analyzed fields in qf
edismaxSupports full lucene query syntax in the absence of syntax errors
supports "and"/"or" to mean "AND"/"OR" in lucene syntax mode
When there are syntax errors, improved smart partial escaping of special characters is done to prevent them... in this mode, fielded queries, +/-, and phrase queries are still supported.
Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.
advanced stopword handling... stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.
Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead of adding it in
Supports pure negative nested queries... so a query like +foo (-foo) will match all documents
function queries
termfreq, tf, docfreq, idf, norm, maxdoc, numdocs
{!func}termfreq(text,ipod)
standard java.util.Math functions
facetingper-segment, single-valued fields:
facet.method=fcs (field cache per segment)
facet.field={!threads=-1}field_name
threads=0: direct execution
threads=-1: thread per segment
speeds up single and multivalued method=fc, especially for deep paging with facet.offset
date faceting improvements, generalized for numeric ranges too
can now exclude main query q={!tag=main}the+query&facet.field={!ex=main}category
pivot/grid/matrix/tree faceting
is this also "hierarchical faceting"? it depends!
pivot faceting output/select?q=*:*&rows=0&facet=on&facet.pivot=cat,popularity,inStock&facet.pivot=popularity,cat
spell checking
DirectSolrSpellChecker
no external index needed, uses automaton on main index
spellcheck configsolrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">textgen</str>
<!-- a spellchecker that uses no auxiliary index --> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="classname">solr.DirectSolrSpellChecker</str> <str name="minPrefix">1</str> </lst> </searchComponent>
spellcheck handler
solrconfig.xml <requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="spellcheck">true</str> <str name="spellcheck.collate">true</str> </lst>
<arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
spellcheck response{ 'responseHeader'=>{ 'status'=>0, 'QTime'=>10, 'params'=>{ 'indent'=>'on', 'wt'=>'ruby', 'q'=>'ipud bluck'}}, 'response'=>{'numFound'=>0,'start'=>0,'docs'=>[] }, 'spellcheck'=>{ 'suggestions'=>[ 'ipud',{ 'numFound'=>1, 'startOffset'=>0, 'endOffset'=>4, 'suggestion'=>['ipod']}, 'bluck',{ 'numFound'=>1, 'startOffset'=>5, 'endOffset'=>10, 'suggestion'=>['black']}, 'collation','ipod black']}}
http://localhost:8983/solr/select?q=ipud%20bluck&wt=ruby&indent=on
autosuggest
new "spellcheck" component, builds TST
collates query
can check if collated suggestions yield results, optionally, providing hit count
suggest configsolrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">textgen</str>
<lst name="spellchecker"> <str name="name">suggest</str> <str name="classname">org.apache.solr.spelling.suggest.Suggester</str> <str name="lookupImpl"> org.apache.solr.spelling.suggest.jaspell.JaspellLookup </str> <str name="field">suggest</str> <str name="buildOnCommit">true</str> </lst> </searchComponent>
schema.xml <field name="suggest" type="textgen" indexed="true" stored="false"/>
<copyField source="name" dest="suggest"/>
suggest handler
solrconfig.xml <requestHandler class="solr.SearchHandler" name="/suggest"> <lst name="defaults"> <str name="spellcheck">true</str> <str name="spellcheck.dictionary">suggest</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.count">10</str> <str name="rows">0</str> <str name="spellcheck.maxCollationTries">20</str> <str name="spellcheck.maxCollations">10</str> <str name="spellcheck.collateExtendedResults">true</str> </lst> <arr name="components"> <str>query</str> <!-- to allow suggestion hit counts to be returned --> <str>spellcheck</str> </arr> </requestHandler>
suggest response
{ 'responseHeader'=>{ 'status'=>0, 'QTime'=>2}, 'response'=>{'numFound'=>0,'start'=>0,'docs'=>[] }, 'spellcheck'=>{ 'suggestions'=>[ 'ip',{ 'numFound'=>1, 'startOffset'=>0, 'endOffset'=>2, 'suggestion'=>['ipod']}, 'collation',[ 'collationQuery','ipod', 'hits',3, 'misspellingsAndCorrections',[ 'ip','ipod']]]}}
http://localhost:8983/solr/suggest?q=ip&wt=ruby&indent=on
sort
by function
&q=*:*&sfield=store&pt=39.194564,-86.432947&sort=geodist() asc
but still can't get value of function back
unless you force it to be the score somehow
clustering component
now works out-of-the-box; all Apache license compatible
supports distributed search
debug=true
debug=true|all|timing|query|results
debug=results&debug.explain.structured=true
structured explain
'debug'=>{ 'explain'=>{ 'doc1'=>{ 'match'=>true, 'value'=>0.076713204, 'description'=>'fieldWeight(title:solr in 0), product of:', 'details'=>[{ 'match'=>true, 'value'=>1.0, 'description'=>'tf(termFreq(title:solr)=1)'}, { 'match'=>true, 'value'=>0.30685282, 'description'=>'idf(docFreq=1, maxDocs=1)'}, { 'match'=>true, 'value'=>0.25, 'description'=>'fieldNorm(field=title, doc=0)'}]}}}}
http://localhost:8983/solr/select?q=title:solr&debug.explain.structured=true&debug=results&wt=ruby&indent=on
SolrCloud
shared/central config and core/shard managment via zookeeper,
built-in load balancing, and infrastructure for future SolrCloud work.
/update/jsonsolrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/>
curl 'http://localhost:8983/solr/update/json?commit=true' -H 'Content-type:application/json' -d '{ "add": { "doc": { "id" : "MyTestDocument", "title" : "This is just a test" } }}'
wt=csv
Writes only docs (no response header or response extras) in CSV format
Roundtrippable with /update/csv
provided all fields are stored
UIMA
Unstructured Information Management Architecture
http://uima.apache.org/
New update processor chain, augmenting incoming documents from a UIMA annotator pipeline
http://wiki.apache.org/solr/SolrUIMA
(solr|lucene)-dev
ant [idea|eclipse]
go!
http://wiki.apache.org/solr/HowToContribute
works in progress
some interesting open issues (with patches):
PayloadTermQuery
XMLQueryParser plugin
join
{!join from=$f to=$t}
insert <what Yonik said>
https://issues.apache.org/jira/browse/SOLR-2272
Lucid (imagination)What's Lucid done for you lately -
Yonik, Mark, Grant, Hoss: Lucene and Solr performance, faceting, grouping, join query, spatial, Mahout, ORP, PMC, etc, etc, etc
Other technical staff involved in mailing list assistance, bug reporting, contributing patches (hi Lance, Erick, Jay, Tom, Grijesh, Tomas....)
extended dismax, join, faceting performance improvements
LucidWorks Enterprise
Hoss Simplicity
http://www.lucidimagination.com/blog/2011/01/21/solr-powered-isfdb-part1/
http://www.lucidimagination.com/blog/2011/01/28/solr-powered-isfdb-part-2/
LucidWorks Enterprise"lucid" query parser
click boosting
tunable norms, per-field
role filtering
administrative UI
REST API
Data sources, crawlers, and scheduling
Alerts
http://www.lucidimagination.com/enterprise-search-solutions/lucidworks
Community Questions
fire away!
resources
duh!: #code4lib
lucene.apache.org/solr
search.lucidimagination.com/?q=<your query>
Q&A: faceting
why is paging through facets the way it is?
short-circuits on enum
Community:
- The state of Extended DisMax, and what Lucene features remain incompatible with it.
- Any developments on faceting (I've implemented the standard workaround to the "unknown facet list size" problem... but I'd still love to be able to know exactly how long the lists are)
- Hierarchical documents in Solr -- I haven't followed the conversations closely, but I gather that this topic is gaining some momentum in the Solr community.
contact info
erik.hatcher @ lucidimagination . com
http://www.lucidimagination.com
webinars, documentation
LucidFind: search.lucidimagination.com
search mailing list posts, wiki pages, web sites, our blog, etc for latest Lucene/Solr assistance
re: code4lib