40
cominvent as cominvent as Migrating FAST to Solr by Jan Høydahl Enterprise Search Specialists

Migrating Fast to Solr

Embed Size (px)

DESCRIPTION

Slide deck with key points when migrating from FAST ESP to Apache Solr. By Cominvent AS

Citation preview

Page 1: Migrating Fast to Solr

cominvent as

cominvent as

Migrating FAST to Solrby Jan Høydahl

Enterprise Search Specialists

Page 2: Migrating Fast to Solr

cominvent as

cominvent as

Page 3: Migrating Fast to Solr

cominvent as

Consulting

– Cominvent delivers independent search consulting– Focus on Apache Lucene/Solr & Microsoft FAST ESP– We know both the proprietary and Open Source worlds,

their benefits and disadvantages. We help you choose. We help you maximize your chosen engine, and we help you migrate as your requirements change.

Page 4: Migrating Fast to Solr

cominvent as

Training

– Cominvent AS delivers training public and on-site– Certified Solr Training Partner for Lucid Imagination– Certified FAST ESP Training Partner

– Read more: http://www.cominvent.com/training/

Photo: fluidpowerzone.com

Page 5: Migrating Fast to Solr

cominvent as

Commercial Support

– When community & mailing list support is not enough..– Paid support agreement for Apache Solr/Lucene– In cooperation with Lucid Imagination

– Read more: http://www.cominvent.com/support/

Page 6: Migrating Fast to Solr

cominvent as

Jan Høydahl – experience

● IT architect, 15 years with search, telecom, mobile

● Helped build FAST's Global Services as first engineer

● Founder of Cominvent AS● Search consultant 10 years● Certified Solr instructor

Page 7: Migrating Fast to Solr

cominvent as

Recommendations

«His skills on Fast ESP is in-depth, thorough, and probably amongst the best you can get. Jan is working independently, but also well in teams. Whether it is technical or business work, Jan does not fall behind. His excellent skills to see things from the holistic perspective is great.»

-Knut Stenmark, DPM AS

Page 8: Migrating Fast to Solr

cominvent as

Sample consulting projects

World wide news agencyChief architect of FAST ESP search solution, migrating from Autonomy IDOL. Real-time news, alerting etc.

Major Swedish newspaperArchitect for new Topic Page solution, letting editors define topics based on keywords and regex rules.

Norwegian Yellow Pages actorArchitect for migrating traditional DB backed catalog search to modern one-search box solution.

Classifieds and real estate online brokerAdvise on migrating from DB to search. Architect for FAST ESP solution with Norwegian linguistics, search middleware and relevance tuning.

Leading news surveillance companyHelped implement and tune real-time search using FAST ESP and real-time alerting using FAST RTA.

Page 9: Migrating Fast to Solr

cominvent as

Sample Solr Training references

– Danish national library organization serving all Danish libraries

– Migrating from in-house search to Apache Solr for all their search

– Delivered Solr training course in 2010

– Global library org, serving hundreds of libraries world wide

– Helping them migrate from FAST to Solr

– First step is Classroom Training in March 2010

Library organization

Page 10: Migrating Fast to Solr

cominvent as

Page 11: Migrating Fast to Solr

cominvent as

– Open Source enterprise search server– Built on the popular Apache Lucene library– 100% Java, runs on all platforms and env.– Supports billions of documents, high scalability and

advanced features like faceting, highlighting, document format conversions, GEO search etc

– Indexes most languages including CJK– Platform not language aware, but each field can be

configured to language specific tokenization, stemming, stop word processing etc

– Very active developer and user communities– Apache 2.0 license – commercially friendly– Rapid growth in companies providing support etc

About Apache Solr

Page 12: Migrating Fast to Solr

cominvent as

Solr-user community growth

2006 Jan2006 Mar

2006 May2006 Jul

2006 Sep2006 Nov

2007 Jan2007 Mar

2007 May2007 Jul

2007 Sep2007 Nov

2008 Jan2008 Mar

2008 May2008 Jul

2008 Sep2008 Nov

2009 Feb2009 Apr

2009 Jun2009 Aug

2009 Oct2009 Dec

2010 Feb

0

200

400

600

800

1000

1200

1400

1600

Solr-user growth

Column B

Month

Mes

sag

es

Page 13: Migrating Fast to Solr

cominvent as

Lucene/Solr deployments

– More: http://wiki.apache.org/solr/PublicServers

Thanks to Lucid Imagination for logo collection

Page 14: Migrating Fast to Solr

cominvent as

Solr in media & newspapers

– News search. Also exposes API

– Danish news search

– Swedish news search

– Swedish news search

– Faceted search through classifieds

– Eastern european classifieds

Page 15: Migrating Fast to Solr

cominvent as

Sample FAST-Solr switchers

– Human Rights search• hurisearch.org (blog)

– FINN katalog (former Sesam)• katalog.finn.no (announce)

– Mocality – African business search• mocality.co.ke (linkedin)

– International library search• Large multi-lingual index

– Norwegian media house• Multiple newspapers

Page 16: Migrating Fast to Solr

cominvent as

Solr Architecture

Page 17: Migrating Fast to Solr

cominvent as

The migration...

Page 18: Migrating Fast to Solr

cominvent as

Migration objectives

– Possible objectives include:• Lower maintenance cost• Deeper in-house competency• Less dependent on external consultants• Ownership and visibility of source code• Shorter time to market for new features• Bugs fixed faster – or even fix ourselves• Larger community, mailing lists that work!• More choice in external consultants• Contribute back to Open Source• Lower HW footprint

Page 19: Migrating Fast to Solr

cominvent as

Migration steps

– Knowledge gathering & Training– Review current features & arch

• Want to keep all features? Add new?

– Migration areas:• Index profile• Content• Feeding• Document Processing• Querying• Search middleware?• Admin & Operational

– What to do in Application space vs Search space?

Page 20: Migrating Fast to Solr

cominvent as

Feature comparison ESP – Solr (similarities)

Feature ESP Solr

Full-text, boolean, range search, sorting, sub-second, facets, did-you-mean, synonyms, faceting

Yes Yes

Scaling for QPS Add rows Add rows

Scaling for document volume Add columns Add shards

Synonyms Index/query side Index/query side

GEO search Yes Yes (1.5)

Boolean query language Yes (FQL) Yes (Lucene or(e)DisMax)

APIs HTTP, Java, .NET, C++, PHP

HTTP, Java, .NET, Ruby, Python, PHP, Perl, JS

Page 21: Migrating Fast to Solr

cominvent as

Feature comparison ESP – Solr (differences)

Feature ESP Solr

Admin server Yes No (coming 1.5)

Processes Many (C++, Java, Python)

One WAR in Java app-server, 100% Java

Navigators / Facets Index-time Query-time

Did-you-mean Dictionary based Dictionary or index based

Feeding API only HTTP POST or API

Document processing Pipeline (py) Simple pipeline (Java, JS, Groovy, Jython, JRuby..)

Multi field querying Composite fields DisMax handler

Page 22: Migrating Fast to Solr

cominvent as

Feature comparison ESP – Solr (differences)

Feature ESP Solr

Relevancy tuning Rank profiles, term boosting

Dynamic function queries and boost functions

XRANK XRANK operator Function Queries

Freshness boost Freshness in rank profile

Function Queries

Boost GEO distance Rank profile and special

Function Queries

Major schema or software updates Cold update, use stage environment

Stage new content into new Solr core

Pluggability Docprocs, clients Everything :)Request Handlers, Query Parsers, Docprocs, Rank, Spell, tokenizer++

Page 23: Migrating Fast to Solr

cominvent as

Feature comparison ESP – Solr (differences)

Feature ESP Solr

Lemmatization Can be licensed for many languages

Can be licensed from 3rd party

Query syntax and(a:foo, b:bar)i:range(0, 100)

d:range(2000-01-01T00:00:00, 2010-03-03T12:00:00)

a:foo OR b:barI:[0 TO 100]

d:[2000-01-01T00:00:00Z TO NOW]

Query params query=offset=hits=spell=1

q=start=rows=spellcheck=true

What fields to return view=viewname fl=title,price,body...

Page 24: Migrating Fast to Solr

cominvent as

Your FAST system - overview

Search middleware?

Your web-app

Graphics diagram: www.microsoft.com

Page 25: Migrating Fast to Solr

cominvent as

Migrating index profile

– ESP index profile -> Solr schema.xml– Setup field types, use defaults or create your own– Setup the static fields. ESP:

– Solr equivalent:

– No need for generic*, use dynamic fields:

Page 26: Migrating Fast to Solr

cominvent as

Migrating index profile

– Composite fields?• Solr can use <copyField> to copy multiple fields into

one, e.g. as we did to map many attributes into one field

• However, to achieve ranking with different boost of each field, Solr does not need composite field. Use DisMax query handler instead. Very powerful!

– No need to edit schema to add new fields. Using dynamic fields, it is easy to e.g. Introduce a color facet for cars or a Mpixels facet for digital cameras

Page 27: Migrating Fast to Solr

cominvent as

DisMax query example

– This Solr query can replace use of composite-field• qt=dismax• q=oslo• qf=title^0.7 highpriorityfields^1.5

mediumpriorityfields^0.6 lowpriorityfields^0.2 recallfields^0.0 body^0.0

• bf=recip(rord(creationDate),1,1000,1000)

Page 28: Migrating Fast to Solr

cominvent as

Migrating content

– If using FAST ContentAPI to push programatically• Use Solr's clients (Java, .NET, Ruby, Python, PHP...)

– If feeding FastXML using FileTraverser• Feed as Solr XML using HTTP POST or a POST client

– If you feed custom XML with XMLMapper• Have a look at DIH's import and mapping features

Page 29: Migrating Fast to Solr

cominvent as

Push Feeding example

– Feed XML using HTTP POST:• curl http://localhost:8080/solr/update?commit=true

-H "Content-Type: text/xml" --data-binary @mydoc.xml

– Ruby example:• >gem sources -a http://gemcutter.org

>sudo gem install rsolrrequire 'rsolr'solr = RSolr.connect :url=>'http://localhost:8080' documents = [{:id=>1, :price=>1.00},

{:id=>2, :price=>10.50}]solr.add documentssolr.commit

Page 30: Migrating Fast to Solr

cominvent as

Pull: DataImportHandler (DIH)

Page 31: Migrating Fast to Solr

cominvent as

Querying examples

– http://localhost:8080/solr/select?q=car&fl=id,title

– Ruby• res=solr.select :q=>'roses', :fq=>['red','white']

res['response']['docs'].each do |doc| puts doc['title']end

Page 32: Migrating Fast to Solr

cominvent as

Migrating document processing

– Solr lacks a sophisticated pipeline with entity extraction etc. Alternatives:

• Do extraction in Application space (Ruby)• Write own stage in Solr pipeline for simple cases• Integrate to do more advanced stuff

– Matchers/extractors• LingPipe NamedEntityExtractor inside of OpenPipeline

– Synonyms:• Use Solr's synonym handling index/query side

– Custom stages:• Write a Solr UpdateProcessor (in Java, Jython etc)

– Got a LOT of custom FAST docproc stages?• Have a look at SESAT's PY ProcServer for Solr (GPL)

Page 33: Migrating Fast to Solr

cominvent as

Migrating linguistics (lemmatization)

– Solr ships with Stemming instead of Lemmatization– Stemming has limitations

• Biler, bilen, bilene -> bilBUT

• Bøker, bøkene -> bøk; boka, bok -> bok

– Kstem better. Free with LucidWorks for Solr– If you need singular/plural handling only

• Free dictionaries? Check lucene-hunspell

– Lemmatization can be licensed from 3rd party such as Basistech, who also has language identification & entity extraction

– Language identification also from Sematext

Page 34: Migrating Fast to Solr

cominvent as

Basistech Rosette for Lucene

– High-end linguistics capabilities for19 languages

– Language Identification– Segmentation and tokenization– Lemmatization– Noun decompounding– Part-of-speech tagging– Entity extraction

– Easily integrated with Lucene/Solr

– More: http://www.basistech.com/lucene/

Page 35: Migrating Fast to Solr

cominvent as

Migrating search middleware

– Using FAST Unity?• Consider migrating middleware logic such as external

source querying and federation to SESAT (AGPL)

– Using Comperio Front?• Must migrate custom query and resp formats• Consider SESAT as well for migrating flow logic

– Or is plain Solr enough?• Solr has built-in support for shards• A shard query will query multiple shards

and merge the results into one• Add custom processing as Query

Components in Solr• Check contrib & patches!

Page 36: Migrating Fast to Solr

cominvent as

Migrating Web Crawler

– Solr has no built-in web crawler• Instead you can choose from several integrations

– The Apache Nutch crawler• Proven with hundreds of millions of pages• http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

– Apache Droids• Still an incubator, but aims at becoming a full crawler• http://incubator.apache.org/droids/

– Heritix + Solr (example in Solr1.4 book)– OpenPipeline has a (very) simple crawler– Lucene Connectors Framework

• Preparing crawler support

Page 37: Migrating Fast to Solr

cominvent as

Migrating Connectors

– Solr handles these sources internally through DIH:• Database, RSS, Web-services, Local filesystem

– Additionally throgh Lucene Connectors Framework:•

• EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio, SharePoint, RSS

• New connectors should be written for LCF

– Another option: Open Pipeline, supporting:•• Sharepoint, IMAP, Documentum, Vignette, Filesystem

Page 38: Migrating Fast to Solr

cominvent as

Operations

– Solr has no admin-server (coming in 1.5)– Possible to run multiple Tomcat on same server– Multiple cores in same Tomcat – easier migration– No built-in query reports, use 3rd party tools– No built-in monitoring, have a look at Nagios?

Page 39: Migrating Fast to Solr

cominvent as

More info

– Solr WIKI: http://wiki.apache.org/solr/– Deployments: http://wiki.apache.org/solr/PublicServers– Reference Guide: http://tinyurl.com/ygj3q9j – Solr Book: http://tinyurl.com/solrbook – Solr training: http://www.solrtraining.com/

Page 40: Migrating Fast to Solr

cominvent as

Thank You

www.cominvent.com

www.twitter.com/cominvent

[email protected]

This presentation licensed under CC-by-sa licenseYou must attribute Cominvent with name and link