28
Apache Solr Ramzi Alqrainy Search Guy Part 1

Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Embed Size (px)

DESCRIPTION

Apache Solr 4 Part 1 - Why use Solr ? How to build Recency Ranking and Popularity Ranking ?

Citation preview

Page 1: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Apache Solr!

Ramzi Alqrainy!

Search Guy!

Part 1!

Page 2: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

What !is Apache Solr ?!

Page 3: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Apache Solr!!

is!

“ a standalone full-text search server with Apache Lucene at the backend. “!

!

!

Page 4: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Cont.!

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. !

!

In brief Apache Solr exposes Lucene's JAVA API as REST like API's which can be called over HTTP from any programming language/platform.!

Page 5: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Why!use Apache Solr ?!

Page 6: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Features!

l  Full Text Search!l  Faceted navigation!l  More items like this(Recommendation)/

Related searches !l  Spell Suggest/Auto-Complete!l  Custom document ranking/ordering!l  Snippet generation/highlighting!And a lot More....!

Page 7: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Why Solr ?!

Also, Solr is only provides :!

1. Result Grouping / Field Collapsing!

2. Query Elevation!

3. Pivot Facet!

4. Pluggable Search/update Workflow!

5. Hash-Based Duplication!

Page 8: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Field Collapsing  

“ Collapses a group of results with the same field value down to a single (or fixed number) of entries.”!

For example, most search engines such as Google collapse on site so only one or two entries are shown, along with a link to click to see more results from that site. Field collapsing can also be used to suppress duplicate documents.!

Page 9: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Result Grouping  

“ groups documents with a common field value into groups, returning the top documents per group, and the top groups based on what documents are in the groups”!

One example is a search at Best Buy for a common term such as DVD, that shows the top 3 results for each category ("TVs & Video","Movies","Computers", etc)!

Page 10: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Query Elevation  

enables you to configure the top results for a given query regardless of the normal lucene scoring. This is sometimes called "sponsored search", "editorial boosting" or "best bets".!

Page 11: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Pivot Facet  

You can think of it as "Decision Tree Faceting" which tells you in advance what the "next" set of facet results would be for a field if you apply a constraint from the current facet results!

Page 12: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Pluggable Search/update Workflow  

You can modify the workflow of existing API endpoints / document instert or updates!

Page 13: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Hash-Based Duplication  

Determining the uniqueness of a document not based on ad ID-Field, but the hash signature of a field.!

!

Useful for web pages for example, where the URL may be different but the content the same.!

Page 14: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Boost documents by age!

•  Just do a descending sort by age = done?!

•  B o o s t m o r e r e c e n t d o c u m e n t s a n d p e n a l i z e o l d e r documents just for being old!

•  U s e f u l f o r n e w s , business docs, and local search !

Page 15: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Solr: Indexing!In schema.xml: <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> <field name="pubdate" type="tdate" indexed="true" stored="true" required="true" />

Date published = DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);

Page 16: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

FunctionQuery Basics!•  FunctionQuery: Computes a value for each

document!– Ranking!– Sorting!

constant literal fieldvalue ord rord sum sub product

pow abs log sqrt map scale query linear

recip max min ms sqedist - Squared Euclidean Dist hsin, ghhsin - Haversine Formula geohash - Convert to geohash strdist

Page 17: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Solr: Query Time Boost!•  Use the recip function with the ms function:!q={!boost b=$recency v=$qq}& recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)& qq=wine

•  Use edismax vs. dismax if possible:!

q=wine& boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)

•  Recip is a highly tunable function!–  recip(x,m,a,b) implementing a / (m*x + b) –  m = 3.16E-11 a= 0.08 b=0.05 x = Document Age

17

Page 18: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Tune Solr recip function!

18

Page 19: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Tips and Tricks!•  Boost should be a multiplier on the relevancy score !

•  {!boost b=} syntax confuses the spell checker so you need to use spellcheck.q to be explicit!q={!boost b=$recency v=$qq}&spellcheck.q=wine

•  Bottom out the old age penalty using min:!–  min(recip(…), 0.20)

•  Not a one-size fits all solution – academic research focused on when to apply it !

19

Page 20: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

•  Score based on number of unique views!

•  Not known at indexing time!

•  View count should be broken into time slots!

20

Boost by Popularity!

Page 21: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Popularity Illustrated!

21

Page 22: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Solr: ExternalFileField!In schema.xml: <fieldType name="externalPopularityScore" keyField="id" defVal="1" stored="false" indexed="false" class=”solr.ExternalFileField" valType="pfloat"/> <field name="popularity" type="externalPopularityScore" />

22

Page 23: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Popularity Boost: Nuts & Bolts!

23

Logs  Solr  Server  

User activity logged

View  Coun1ng  Job  

solr-home/data/ external_popularity

a=1.114 b=1.05 c=1.111 …

commit

Page 24: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Popularity Tips & Tricks •  For big, high traffic sites, use log analysis!

– Perfect problem for MapReduce!– Take a look at Hive for analyzing large volumes

of log data!

•  Minimum popularity score is 1 (not zero) … up to 2 or more!–  1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth

…)!

•  Watch out for spell checker “buildOnCommit”!

24

Page 25: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Filtering By User Preferences •  Easy approach is to build basic preference

fields in to the index:!– Content types of interest – content_type!– High-level categories of interest - category!– Source of interest – source!

!•  We had too many categories and sources that

a user could enable / disable to use basic filtering!– Custom SearchComponent with a connection to a

JDBC DataSource!

25

Page 26: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Preferences Component!

•  Connects to a database!•  Caches DocIdSet in a Solr FastLRUCache!•  Cached values marked as dirty using a

simple timestamp passed in the request!!Declared in solrconfig.xml:! <searchComponent ! class=“demo.solr.PreferencesComponent" ! name=”pref">! <str name="jdbcJndi">jdbc/solr</str> ! </searchComponent>! 26

Page 27: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

YOU HAVE QUESTIONS …. WE HAVE ANSWERS!

[email protected]!

Page 28: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

References!

•  h5p://wiki.apache.org/solr/  •  h5p://www.lucidworks.com/  •  Apache  Solr  4  Cookbook