Upload
daniel-tunkelang
View
4.027
Download
2
Embed Size (px)
DESCRIPTION
These slides were used for a presentation by Daniel Tunkelang (Google) and Otis Gospondetic (Sematext) at the New York CTO Club on December 9th, 2009.Faceted SearchPeople come to your site to get the information they need, by exploring, discovering, and making comparisons. You want them to successfully sift through all of your content, quickly and effectively. The traditional approach of providing a search box and a ranked list of results can frustrate users, who need more guidance in order to find what they are looking for--or even know if the information is available.Enter faceted search. Faceted search enables users to navigate a multi-dimensional information space by combining text search with a progressive narrowing of choices in each dimension. This technique has become ubiquitous in online retail, and is increasingly popular in other domains, both on the public internet and on intranets.This talk will review the basic concepts of faceted search, and then dive into some of the subtler concerns. Specifically, we will elaborate on both the design and implementation concerns that determine whether a faceted search deployment will be successful.Our own Daniel Tunkelang co-founded Endeca, a pioneer in faceted search, and worked there for 10 years before recently moving to Google. In addition to building the world's leading commercial technology for faceted search, he has played an active role in engaging the broader community of researchers and practitioners to advance understanding of this field. These efforts include organizing an annual workshop on human-computer information retrieval and publishing a textbook on faceted search.Otis Gospodnetic is the co-founder of Sematext, a Lucene expert, co-author of Lucene in Action and upcoming Solr in Action, and a long-time Lucene and Solr developer with over 10 years of experience in search and related technologies. Sematext implements open-source search, linguistic, and text analytics technology in the enterprise. They focus on the development of scalable and high-performance search solutions.
Citation preview
New York CTO ClubDecember 9, 2009
Daniel Tunkelang, GoogleOtis Gospodneti!, Sematext
Faceted Search
Agenda
Daniel:! What is faceted search?
! Why use faceted search?
! Thoughts about design and user experience.
Otis:! What are Lucene and Solr?
! Why use an open-source search library?
! Thoughts about implementation.
“Regular” Search
Interface:
! User expresses information need as short query.
! Search engine returns ranked, pageable result set.
User happy when...
! Top-ranked result satisfies information need.
! At least some result on first page is relevant.
User unhappy when...
! No result on first page satisfies information need.
! Results misleadingly appear relevant (bait and switch).
Relevance Is Subjective
Relevance is defined as a measure of
information conveyed by a document relative to
a query.
It is shown that the relationship between the
document and the query, though necessary, is
not sufficient to determine relevance.
William Goffman, On relevance as a measure, 1964.
Regular Search Experience
Assumptions Are Dangerous
! self-awareness
! self-expression
! model knows best
! answer is a document
! one-shot query
tf-idfPageRank
What is Faceted Search?
! Best understood through examples.
" See the following slides.
" Or shop on almost any ecommerce site.
! Facets = multiple ways to organize information.
" Often based on available structured information.
" But not always, e.g., facets obtained via text mining.
! Typical interaction:
" User starts with a full-text search.
" Facets guide query refinement process.
Faceted Search for News
Faceted Search for People
Faceted Search for Breakfast
But Facets are Not a Silver Bullet...
! Screen real estate is finite.
" Choose facets wisely.
" Choose facet values wisely for monster facets.
! Multiple selection within a facet is powerful, but...
" Has to be intuitive, especially AND vs. OR.
" Even trickier for hierarchical facets.
! Search relevance still matters!
" Most faceted search applications rank results.
" Irrelevant results " irrelevant facet refinements.
Exploring Information Science
Deliver Precision and Recall
Easier said than done!
Ranking of facet values is an open research topic.
Be Careful with Faceted Search!
Cameras have artists?!
Clarify, Then Refine
Take-Aways
! Faceted search addresses the subjectivity of relevance and information overload.
! But deploying faceted search effectively requires that you think about user experience.
! Recommended reading:
" My thin book entitled Faceted Search
" Marti Hearst's book on Search User Interfaces
" Peter Morville's upcoming book on Search Patterns
Otis Gospodneti!, Sematext
Faceted Search with Lucene & Solr
What is / isn't Lucene
! Free, ASL, Java IR library, Jar
! Doug Cutting, ASF, 2001
! Application agnostic: Indexing & Searching
! High performance, scalable
! No dependencies
! Heavily ported
! No: crawler, rich doc parser, turn-key solution
! No: out of the box faceted search-capability... but...
What is/isn't Solr
! Indexing/Search server with HTTP API built on
top of Lucene
! Fast & scalable (distributed search, index
replication) #
! XML, JSON, Ruby, Perl, PHP, javabin
! No: crawler (but Nutch ==> Solr works) #
! Yes: rich text parser
! Yes: Faceted Search out of the box!
Solr and Faceted Search
! 3 Types of facets: Field Values (text), Dates,
Queries.
! “Text”: return counts for all/top terms in a field
for a result set - e.g. categories a la Amazon
! Dates: return counts for docs in specified date
ranges
! Queries: return counts for docs that also match
a given query - handy for number ranges (think
prices!)#
Facet Field Requirements
! Must be indexed
! Often not tokenized
! Often not altered (lowercase, punctuation) #
! Storing not required
! Multivalued fields OK
Turn It On
! 0 facets:! http://host:80/solr/select?q=foo
! 1 facet: ! http://host:80/solr/select?q=foo&facet=true&facet.field=category
! N facets:! http://host:80/solr/select?
q=foo&facet=true&facet.field=category&facet.field=inStock
! facet=true or facet.on
Text Facet Response
<result numFound="4" start="0"/>
<lst name="facet_counts">
<lst name="facet_fields">
<lst name="category">
<int name="electronics">3</int>
<int name="copier">0</int>
</lst>
<lst name="inStock">
<int name="false">3</int>
<int name="true">1</int>
</lst>
</lst>
</lst>
! facet.mincount=1 to
avoid 0-count facet
values
! facet.limit=N to limit to
top N facet values
! facet.missing=true to
catch uncategorized
! lots of other options!
Date Facets
! http://.../solr/select/?
q=*:*&rows=0&facet=true&facet.date=timesta
mp&facet.date.start=NOW/DAY-
5DAYS&facet.date.end=NOW/DAY
%2B1DAY&facet.date.gap=%2B1DAY
! (%2B1 ==> +1) #
! Solr Date Math Parser syntax: /HOUR,
+2YEARS, -1DAY, /DAY+6MONTHS+3DAYS,
+6MONTHS+3DAYS/DAY
Date Facet Response
<result name="response" numFound="42" start="0"/>
<lst name="facet_counts">
<lst name="facet_dates">
<lst name="timestamp">
<int name="2007-08-11T00:00:00.000Z">1</int>
<int name="2007-08-12T00:00:00.000Z">5</int>
<int name="2007-08-13T00:00:00.000Z">3</int>
<int name="2007-08-14T00:00:00.000Z">7</int>
<int name="2007-08-15T00:00:00.000Z">2</int>
<int name="2007-08-16T00:00:00.000Z">16</int>
<str name="gap">+1DAY</str>
<date name="end">2007-08-17T00:00:00Z</date>
</lst>
Query Facets
! http://.../solr/select?
q=shoes&rows=0&facet=true&facet.field=inStoc
k&facet.query=price:
[*+TO+500]&facet.query=price:[500+TO+*]
! Avoids the bucket-at-index-time work-around
! Keep queries disjoint
Query Facet Response
<result numFound="3" start="0"/>
<lst name="facet_counts">
<lst name="facet_queries">
<int name="price:[* TO 500]">3</int>
<int name="price:[500 TO *]">1</int>
</lst>
<lst name="facet_fields">
<lst name="inStock">
<int name="false">3</int>
<int name="true">1</int>
</lst>
</lst>
</lst>
UI Integration
! Use Filter Queries via fq
! http://.../solr/select?
q=shoes&facet=true&facet.field=category&
fq=price:[0 TO 300]
! http://.../solr/select?
q=shoes&facet=true&facet.field=category&
fq=price:[0 TO 300]&fq=inStock:true
! Important: single request does it all
State of Lucene & Solr
! Super healthy community, exploding
development
! Lucene 3.0 – 2009-11-25:
! Performance, faster range queries, clean API, better
Unicode support, more non-English support
! Solr 1.4 – 2009-11-10:
! Performance, new replication, Db indexing, rich-doc
indexing, results clustering, faster response protocol,
deduplication...
Lucene, Solr, Enterprise
! Free: Community
! Lucene ~ 600 emails/month (dev: 2000/month) #
! Solr ~1300 emails/month (dev: 800/month) #
! Commercial: Support Subscriptions
! Sematext
! Lucid Imagination