Transcript
Page 1: The Future of Search in Plone

The Future of Search in PloneSally Kleinfeldt

and friendsPlone Conference, San Francisco

November 3, 2011

Tuesday, November 29, 2011

Page 2: The Future of Search in Plone

Motivation

• Raise awareness

• Promote discussion

• Forge consensus

Tuesday, November 29, 2011

Page 3: The Future of Search in Plone

Agenda

• Introduction to IR concepts

• Description of Solr and ZCatalog

• Discussion

Tuesday, November 29, 2011

Page 4: The Future of Search in Plone

IR 101

Tuesday, November 29, 2011

Page 5: The Future of Search in Plone

IR 101

• Transformations

• Terms

• Models

• Measures

Tuesday, November 29, 2011

Page 6: The Future of Search in Plone

IR 101Transformations

• Turn binary, HTML, or other document formats into fields and strings

• Parse the strings into a set of terms

• Build indexes of the terms specific to the IR model used

• Queries are parsed into query operators and strings, which are parsed into terms

Tuesday, November 29, 2011

Page 7: The Future of Search in Plone

IR 101String => Terms

• Tokenization - locate word boundaries

• Normalization - remove capitals and diacritics

• Stopping - remove stop words (a, of, on, the...)

• Stemming - reduce to word stems (walks, walking => walk)

• Recognizers - concepts, parts of speech, names, locations...

• Must be identical for documents and queries

Tuesday, November 29, 2011

Page 8: The Future of Search in Plone

IR 101Terms

• Application specific

• Words or phrases

• IR models assign weights to terms in documents

Tuesday, November 29, 2011

Page 9: The Future of Search in Plone

IR 101Term Weighting

• Simplest: Yes/No Boolean value

• Better: Term Frequency - # occurrences

• More meaningful: tf-idf

• Term Freq * Inverse Document Freq

• How many documents contain the term?

• Increase weight of rare terms and vice versa

Tuesday, November 29, 2011

Page 10: The Future of Search in Plone

IR 101Boolean Model

• First and most adopted

• Based on Boolean logic + set theory

• Does a document contain query terms - Y/N

• Intuitive, easy to implement

• No ranking, special query language, too many or too few results

• Typical for library systems

Tuesday, November 29, 2011

Page 11: The Future of Search in Plone

IR 101Vector Space Models

• Represent documents and queries as vectors of terms

• Term values are weighted - by count or tf-idf

• Use vector operations to compare documents with queries

• Relevance score based on cosine of angle between doc/query vectors

Tuesday, November 29, 2011

Page 12: The Future of Search in Plone

IR 101Probabilistic Models

• Compute probability that a document is relevant to a query

• Relevance ranking functions range from simple to complex

• Sophisticated ranking functions include

• Okapi BM25 (uses tf and idf)

• Machine learning formulas (use training data)

Tuesday, November 29, 2011

Page 13: The Future of Search in Plone

IR 101Extending the Models

• Many many refinements possible

• Term interdependencies

• Fuzzy sets

• Semantic analysis, link analysis

• Combining models (Extended Boolean)

• The best search engines represent thousands of engineering hours

Tuesday, November 29, 2011

Page 14: The Future of Search in Plone

IR 101Measures

• Search engine results are measured against:

• Precision - Percent of results that are relevant

• Recall - Percent of relevant results that are returned

• F-Score - Harmonic mean of precision and recall

Tuesday, November 29, 2011

Page 15: The Future of Search in Plone

ZCatalog and Solr

Tuesday, November 29, 2011

Page 16: The Future of Search in Plone

ZCatalog

• Zope/Plone search engine

• Full text and field searching

• Probabilistic model using Okapi BM25

• OOTB ZCTextIndex very simple

• TextIndexNG adds multilingual, better parsing components, binary transforms, synonyms

Tuesday, November 29, 2011

Page 17: The Future of Search in Plone

Solr

• Popular open source enterprise search platform

• Eliminating smaller commercial search companies

• Java, based on Lucene Java search library, sophisticated vector space ++ model

• RESTful APIs

• Large, active community

• Powers Twitter, Wikipedia, Netflix...

Tuesday, November 29, 2011

Page 18: The Future of Search in Plone

What does Solr have that ZCatalog Doesn’t?• Better relevance ranking

• More search features: snippets, hit highlighting, spelling suggestions, synonyms, more like this, faceted search

• More configurable: stop words, field boosting, parsing components

• An army of engineers working on it

Tuesday, November 29, 2011

Page 19: The Future of Search in Plone

Plone + SolrToday

• Two add-ons available

• collective.solr - Intercepts catalog queries and dispatches them to Solr

• alm.solrindex - adds a new index type to the catalog, SolrIndex

• Plus a buildout recipe: collective.recipe.solrinstance

Tuesday, November 29, 2011

Page 20: The Future of Search in Plone

Conclusions from Conference Discussion

Tuesday, November 29, 2011

Page 21: The Future of Search in Plone

Why Does Plone Need Solr?

• Certain types of projects need it, for features or because ZCatalog can’t scale to very large sites

• We need it to keep up with the enterprise CMS pack

Tuesday, November 29, 2011

Page 22: The Future of Search in Plone

Points of Agreement

• It will be impossible to completely replace ZCatalog with Solr

• Solr indexing will never be transactional

• Removing ZCatalog from Zope would be very difficult

• Tackle small, focused ZCatalog improvements when possible - like improving indexing interface

Tuesday, November 29, 2011

Page 23: The Future of Search in Plone

Points of Agreement

• Navigation and search should be handled separately

• Navigation needs to be transactional, search does not

• Split out a catalog used for navigation from the general catalog

• Explore a non-catalog utility to support navigation, optimize for speed

Tuesday, November 29, 2011

Page 24: The Future of Search in Plone

Points of Agreement

• Treating Solr integration simply as ZCatalog replacement does not take best advantage of Solr features

• ZCatalog can’t represent the richness of Solr, focus on the Solr API

• Take advantage of spelling suggestions, facets, results snippets with hit highlighting, synonyms, more like this, etc.

• Provide Solr indexing, field weighting, etc. configuration choices in the control panel

Tuesday, November 29, 2011

Page 25: The Future of Search in Plone

Points of Agreement

• Neither of the current Solr add-ons provides the best foundation for the future

• But they’ve taught us how to do things better

• Non-Solr approaches to improved Plone search should be deprecated

• Andreas Jung is not planning improvements to TextIndexNG!

Tuesday, November 29, 2011

Page 26: The Future of Search in Plone

Points of Agreement

• Stop investing in ZCatalog as a search engine, Solr is the future

Tuesday, November 29, 2011

Page 27: The Future of Search in Plone

Plone + SolrRoadmap

• Short term: Make Solr integration easy with an approved add-on (like LDAP)

• Build on what we’ve learned and create a better add-on to replace collective.solr and alm.solrindex

• Who wants to sponsor a sprint?

Tuesday, November 29, 2011

Page 28: The Future of Search in Plone

Plone + SolrRoadmap

• Long term: Ship Solr integration with Plone, but don’t require Solr

• Solr has a lot of overhead and is not always needed

• But using it should be as easy as answering yes to a “Build with Solr?” installation option

Tuesday, November 29, 2011