Upload
cominvent-as
View
10.332
Download
0
Embed Size (px)
DESCRIPTION
Slide deck with key points when migrating from FAST ESP to Apache Solr. By Cominvent AS
Citation preview
cominvent as
cominvent as
Migrating FAST to Solrby Jan Høydahl
Enterprise Search Specialists
cominvent as
cominvent as
cominvent as
Consulting
– Cominvent delivers independent search consulting– Focus on Apache Lucene/Solr & Microsoft FAST ESP– We know both the proprietary and Open Source worlds,
their benefits and disadvantages. We help you choose. We help you maximize your chosen engine, and we help you migrate as your requirements change.
cominvent as
Training
– Cominvent AS delivers training public and on-site– Certified Solr Training Partner for Lucid Imagination– Certified FAST ESP Training Partner
– Read more: http://www.cominvent.com/training/
Photo: fluidpowerzone.com
cominvent as
Commercial Support
– When community & mailing list support is not enough..– Paid support agreement for Apache Solr/Lucene– In cooperation with Lucid Imagination
– Read more: http://www.cominvent.com/support/
cominvent as
Jan Høydahl – experience
● IT architect, 15 years with search, telecom, mobile
● Helped build FAST's Global Services as first engineer
● Founder of Cominvent AS● Search consultant 10 years● Certified Solr instructor
cominvent as
Recommendations
«His skills on Fast ESP is in-depth, thorough, and probably amongst the best you can get. Jan is working independently, but also well in teams. Whether it is technical or business work, Jan does not fall behind. His excellent skills to see things from the holistic perspective is great.»
-Knut Stenmark, DPM AS
cominvent as
Sample consulting projects
World wide news agencyChief architect of FAST ESP search solution, migrating from Autonomy IDOL. Real-time news, alerting etc.
Major Swedish newspaperArchitect for new Topic Page solution, letting editors define topics based on keywords and regex rules.
Norwegian Yellow Pages actorArchitect for migrating traditional DB backed catalog search to modern one-search box solution.
Classifieds and real estate online brokerAdvise on migrating from DB to search. Architect for FAST ESP solution with Norwegian linguistics, search middleware and relevance tuning.
Leading news surveillance companyHelped implement and tune real-time search using FAST ESP and real-time alerting using FAST RTA.
cominvent as
Sample Solr Training references
– Danish national library organization serving all Danish libraries
– Migrating from in-house search to Apache Solr for all their search
– Delivered Solr training course in 2010
– Global library org, serving hundreds of libraries world wide
– Helping them migrate from FAST to Solr
– First step is Classroom Training in March 2010
Library organization
cominvent as
cominvent as
– Open Source enterprise search server– Built on the popular Apache Lucene library– 100% Java, runs on all platforms and env.– Supports billions of documents, high scalability and
advanced features like faceting, highlighting, document format conversions, GEO search etc
– Indexes most languages including CJK– Platform not language aware, but each field can be
configured to language specific tokenization, stemming, stop word processing etc
– Very active developer and user communities– Apache 2.0 license – commercially friendly– Rapid growth in companies providing support etc
About Apache Solr
cominvent as
Solr-user community growth
2006 Jan2006 Mar
2006 May2006 Jul
2006 Sep2006 Nov
2007 Jan2007 Mar
2007 May2007 Jul
2007 Sep2007 Nov
2008 Jan2008 Mar
2008 May2008 Jul
2008 Sep2008 Nov
2009 Feb2009 Apr
2009 Jun2009 Aug
2009 Oct2009 Dec
2010 Feb
0
200
400
600
800
1000
1200
1400
1600
Solr-user growth
Column B
Month
Mes
sag
es
cominvent as
Lucene/Solr deployments
– More: http://wiki.apache.org/solr/PublicServers
Thanks to Lucid Imagination for logo collection
cominvent as
Solr in media & newspapers
– News search. Also exposes API
– Danish news search
– Swedish news search
– Swedish news search
– Faceted search through classifieds
– Eastern european classifieds
cominvent as
Sample FAST-Solr switchers
– Human Rights search• hurisearch.org (blog)
– FINN katalog (former Sesam)• katalog.finn.no (announce)
– Mocality – African business search• mocality.co.ke (linkedin)
– International library search• Large multi-lingual index
– Norwegian media house• Multiple newspapers
cominvent as
Solr Architecture
cominvent as
The migration...
cominvent as
Migration objectives
– Possible objectives include:• Lower maintenance cost• Deeper in-house competency• Less dependent on external consultants• Ownership and visibility of source code• Shorter time to market for new features• Bugs fixed faster – or even fix ourselves• Larger community, mailing lists that work!• More choice in external consultants• Contribute back to Open Source• Lower HW footprint
cominvent as
Migration steps
– Knowledge gathering & Training– Review current features & arch
• Want to keep all features? Add new?
– Migration areas:• Index profile• Content• Feeding• Document Processing• Querying• Search middleware?• Admin & Operational
– What to do in Application space vs Search space?
cominvent as
Feature comparison ESP – Solr (similarities)
Feature ESP Solr
Full-text, boolean, range search, sorting, sub-second, facets, did-you-mean, synonyms, faceting
Yes Yes
Scaling for QPS Add rows Add rows
Scaling for document volume Add columns Add shards
Synonyms Index/query side Index/query side
GEO search Yes Yes (1.5)
Boolean query language Yes (FQL) Yes (Lucene or(e)DisMax)
APIs HTTP, Java, .NET, C++, PHP
HTTP, Java, .NET, Ruby, Python, PHP, Perl, JS
cominvent as
Feature comparison ESP – Solr (differences)
Feature ESP Solr
Admin server Yes No (coming 1.5)
Processes Many (C++, Java, Python)
One WAR in Java app-server, 100% Java
Navigators / Facets Index-time Query-time
Did-you-mean Dictionary based Dictionary or index based
Feeding API only HTTP POST or API
Document processing Pipeline (py) Simple pipeline (Java, JS, Groovy, Jython, JRuby..)
Multi field querying Composite fields DisMax handler
cominvent as
Feature comparison ESP – Solr (differences)
Feature ESP Solr
Relevancy tuning Rank profiles, term boosting
Dynamic function queries and boost functions
XRANK XRANK operator Function Queries
Freshness boost Freshness in rank profile
Function Queries
Boost GEO distance Rank profile and special
Function Queries
Major schema or software updates Cold update, use stage environment
Stage new content into new Solr core
Pluggability Docprocs, clients Everything :)Request Handlers, Query Parsers, Docprocs, Rank, Spell, tokenizer++
cominvent as
Feature comparison ESP – Solr (differences)
Feature ESP Solr
Lemmatization Can be licensed for many languages
Can be licensed from 3rd party
Query syntax and(a:foo, b:bar)i:range(0, 100)
d:range(2000-01-01T00:00:00, 2010-03-03T12:00:00)
a:foo OR b:barI:[0 TO 100]
d:[2000-01-01T00:00:00Z TO NOW]
Query params query=offset=hits=spell=1
q=start=rows=spellcheck=true
What fields to return view=viewname fl=title,price,body...
cominvent as
Your FAST system - overview
Search middleware?
Your web-app
Graphics diagram: www.microsoft.com
cominvent as
Migrating index profile
– ESP index profile -> Solr schema.xml– Setup field types, use defaults or create your own– Setup the static fields. ESP:
– Solr equivalent:
– No need for generic*, use dynamic fields:
cominvent as
Migrating index profile
– Composite fields?• Solr can use <copyField> to copy multiple fields into
one, e.g. as we did to map many attributes into one field
• However, to achieve ranking with different boost of each field, Solr does not need composite field. Use DisMax query handler instead. Very powerful!
– No need to edit schema to add new fields. Using dynamic fields, it is easy to e.g. Introduce a color facet for cars or a Mpixels facet for digital cameras
cominvent as
DisMax query example
– This Solr query can replace use of composite-field• qt=dismax• q=oslo• qf=title^0.7 highpriorityfields^1.5
mediumpriorityfields^0.6 lowpriorityfields^0.2 recallfields^0.0 body^0.0
• bf=recip(rord(creationDate),1,1000,1000)
cominvent as
Migrating content
– If using FAST ContentAPI to push programatically• Use Solr's clients (Java, .NET, Ruby, Python, PHP...)
– If feeding FastXML using FileTraverser• Feed as Solr XML using HTTP POST or a POST client
– If you feed custom XML with XMLMapper• Have a look at DIH's import and mapping features
cominvent as
Push Feeding example
– Feed XML using HTTP POST:• curl http://localhost:8080/solr/update?commit=true
-H "Content-Type: text/xml" --data-binary @mydoc.xml
– Ruby example:• >gem sources -a http://gemcutter.org
>sudo gem install rsolrrequire 'rsolr'solr = RSolr.connect :url=>'http://localhost:8080' documents = [{:id=>1, :price=>1.00},
{:id=>2, :price=>10.50}]solr.add documentssolr.commit
cominvent as
Pull: DataImportHandler (DIH)
cominvent as
Querying examples
– http://localhost:8080/solr/select?q=car&fl=id,title
– Ruby• res=solr.select :q=>'roses', :fq=>['red','white']
res['response']['docs'].each do |doc| puts doc['title']end
cominvent as
Migrating document processing
– Solr lacks a sophisticated pipeline with entity extraction etc. Alternatives:
• Do extraction in Application space (Ruby)• Write own stage in Solr pipeline for simple cases• Integrate to do more advanced stuff
– Matchers/extractors• LingPipe NamedEntityExtractor inside of OpenPipeline
– Synonyms:• Use Solr's synonym handling index/query side
– Custom stages:• Write a Solr UpdateProcessor (in Java, Jython etc)
– Got a LOT of custom FAST docproc stages?• Have a look at SESAT's PY ProcServer for Solr (GPL)
cominvent as
Migrating linguistics (lemmatization)
– Solr ships with Stemming instead of Lemmatization– Stemming has limitations
• Biler, bilen, bilene -> bilBUT
• Bøker, bøkene -> bøk; boka, bok -> bok
– Kstem better. Free with LucidWorks for Solr– If you need singular/plural handling only
• Free dictionaries? Check lucene-hunspell
– Lemmatization can be licensed from 3rd party such as Basistech, who also has language identification & entity extraction
– Language identification also from Sematext
cominvent as
Basistech Rosette for Lucene
– High-end linguistics capabilities for19 languages
– Language Identification– Segmentation and tokenization– Lemmatization– Noun decompounding– Part-of-speech tagging– Entity extraction
– Easily integrated with Lucene/Solr
– More: http://www.basistech.com/lucene/
cominvent as
Migrating search middleware
– Using FAST Unity?• Consider migrating middleware logic such as external
source querying and federation to SESAT (AGPL)
– Using Comperio Front?• Must migrate custom query and resp formats• Consider SESAT as well for migrating flow logic
– Or is plain Solr enough?• Solr has built-in support for shards• A shard query will query multiple shards
and merge the results into one• Add custom processing as Query
Components in Solr• Check contrib & patches!
cominvent as
Migrating Web Crawler
– Solr has no built-in web crawler• Instead you can choose from several integrations
– The Apache Nutch crawler• Proven with hundreds of millions of pages• http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
– Apache Droids• Still an incubator, but aims at becoming a full crawler• http://incubator.apache.org/droids/
– Heritix + Solr (example in Solr1.4 book)– OpenPipeline has a (very) simple crawler– Lucene Connectors Framework
• Preparing crawler support
cominvent as
Migrating Connectors
– Solr handles these sources internally through DIH:• Database, RSS, Web-services, Local filesystem
– Additionally throgh Lucene Connectors Framework:•
• EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio, SharePoint, RSS
• New connectors should be written for LCF
– Another option: Open Pipeline, supporting:•• Sharepoint, IMAP, Documentum, Vignette, Filesystem
cominvent as
Operations
– Solr has no admin-server (coming in 1.5)– Possible to run multiple Tomcat on same server– Multiple cores in same Tomcat – easier migration– No built-in query reports, use 3rd party tools– No built-in monitoring, have a look at Nagios?
cominvent as
More info
– Solr WIKI: http://wiki.apache.org/solr/– Deployments: http://wiki.apache.org/solr/PublicServers– Reference Guide: http://tinyurl.com/ygj3q9j – Solr Book: http://tinyurl.com/solrbook – Solr training: http://www.solrtraining.com/
cominvent as
Thank You
www.cominvent.com
www.twitter.com/cominvent
This presentation licensed under CC-by-sa licenseYou must attribute Cominvent with name and link