54
Big Search w/ Big Data Principles Basis Technology Open Source Search 2012 Eric Pugh | [email protected] | @dep4b Tuesday, October 2, 2012

OSSCON: Big Search 4 Big Data

Embed Size (px)

DESCRIPTION

At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.

Citation preview

Page 1: OSSCON: Big Search 4 Big Data

Big Search w/ Big Data Principles

Basis Technology Open Source Search 2012 Eric Pugh | [email protected] | @dep4b

Tuesday, October 2, 2012

Page 3: OSSCON: Big Search 4 Big Data

Who am I?

• Principal of OpenSource Connections - Solr/Lucene Search Consultancy

• Member of Apache Software Foundation

• SOLR-284 UpdateRichDocuments (July 07)

• Fascinated by the art of software development

Tuesday, October 2, 2012

Page 4: OSSCON: Big Search 4 Big Data

CO-AUTHOR

2nd edition!

Tuesday, October 2, 2012

Page 5: OSSCON: Big Search 4 Big Data

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

war^

Tuesday, October 2, 2012

Page 6: OSSCON: Big Search 4 Big Data

Not an intro to SolrCloud!

• Great tutorials given by Tomás Fernández Löbbe from LucidWorks yesterday!

Tuesday, October 2, 2012

Page 7: OSSCON: Big Search 4 Big Data

Background for Client X’s Project

• Big Data is any data set that is primarily at rest due to the difficulty of working with it.

• 100’s of millions of documents to search

• Limited selection of tools available.

• Aggressive timeline.

• All the data must be searched per query.

• On Solr 3.x line

Tuesday, October 2, 2012

Page 8: OSSCON: Big Search 4 Big Data

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

Tuesday, October 2, 2012

Page 9: OSSCON: Big Search 4 Big Data

Boy meets Girl Story

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Tuesday, October 2, 2012

Page 10: OSSCON: Big Search 4 Big Data

Bash Rocks

Tuesday, October 2, 2012

Page 11: OSSCON: Big Search 4 Big Data

Bash Rocks

• Remote Solr stop/start scripts

• Remote Indexer stop/start scripts

• Performance Monitoring

• Content Extraction scripts (+Java)

• Ingestor Scripts (+Java)

• Artifact Deployment (CM)

Tuesday, October 2, 2012

Page 12: OSSCON: Big Search 4 Big Data

Make it easy to change approach

Tuesday, October 2, 2012

Page 13: OSSCON: Big Search 4 Big Data

Make it easy to change sharding

public void run(Map options, List<SolrInputDocument> docs) throws InstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } }

Tuesday, October 2, 2012

Page 14: OSSCON: Big Search 4 Big Data

Separate JVM from Solr Cores

• Step 1: Fire up empty Solr’s on all the servers (nohup &).

• Step 2: Verify they started cleanly.

• Step 3: Create Cores (curl http://search1.o19s.com:8983/solr/admin?action=create&name=run2)

• Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=)

Tuesday, October 2, 2012

Page 15: OSSCON: Big Search 4 Big Data

Go Wide Quickly

Tuesday, October 2, 2012

Page 16: OSSCON: Big Search 4 Big Data

shard1shard1shard1shard1 :8983

shard1shard1shard1shard8 :8984

shard1shard1shard1shard12 :8985

search1.o19s.com

shard1shard1shard1shard12 :8985

shard1shard1shard1shard1 :8983

search1.o19s.com

shard1shard1shard1shard8 :8983

search2.o19s.com

shard1shard1shard1shard12 :8983

search3.o19s.com

Tuesday, October 2, 2012

Page 17: OSSCON: Big Search 4 Big Data

Simple Pipeline

• Simple pipeline

• mv is atomic

Tuesday, October 2, 2012

Page 18: OSSCON: Big Search 4 Big Data

Don’t Move Files

• SCP across machines is slow/error prone

• NFS share, single point of failure.

• Clustered file system like GFS (Global File System) can have “fencing” issues

• HDFS shines here.

• ZooKeeper shines here.

Tuesday, October 2, 2012

Page 19: OSSCON: Big Search 4 Big Data

Can you test your changes?

Tuesday, October 2, 2012

Page 20: OSSCON: Big Search 4 Big Data

JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC

Tuesday, October 2, 2012

Page 21: OSSCON: Big Search 4 Big Data

Tuesday, October 2, 2012

Page 22: OSSCON: Big Search 4 Big Data

Run, don’t Walk

Tuesday, October 2, 2012

Page 23: OSSCON: Big Search 4 Big Data

Telling some stories

• Prototyping

•Application Development

• Maintaining Your Big Search Indexes

Tuesday, October 2, 2012

Page 24: OSSCON: Big Search 4 Big Data

Using Solr as key/value store

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

Tuesday, October 2, 2012

Page 25: OSSCON: Big Search 4 Big Data

• thousands of queries per second without real time get.

• how fast with real time get?

http://localhost:8983/solr/run2_enrichment/select?q=id:DOC45242&fl=entities,html

http://localhost:8983/solr/run2_enrichment/get?id=DOC45242&fl=entities,html

Using Solr as key/value store

Tuesday, October 2, 2012

Page 26: OSSCON: Big Search 4 Big Data

Push schema definition to the application

• Not “schema less”

• Just different owner of schema!

• Schema may have common set of fields like id, type, timestamp, version

• Nothing required.

q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor

Tuesday, October 2, 2012

Page 27: OSSCON: Big Search 4 Big Data

Don’t do expensive things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain

Tuesday, October 2, 2012

Page 28: OSSCON: Big Search 4 Big Data

Don’t do expensive things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain

Tuesday, October 2, 2012

Page 29: OSSCON: Big Search 4 Big Data

Beware JavaBin

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

Tuesday, October 2, 2012

Page 30: OSSCON: Big Search 4 Big Data

Beware JavaBin

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

Solr 3.4

Tuesday, October 2, 2012

Page 31: OSSCON: Big Search 4 Big Data

Beware JavaBin

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

Solr 3.4

Solr 4

Tuesday, October 2, 2012

Page 32: OSSCON: Big Search 4 Big Data

Beware JavaBin

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

Solr 3.4

Solr 4

Which SolrJ version do I

use?

Tuesday, October 2, 2012

Page 33: OSSCON: Big Search 4 Big Data

No JavaBin

• Avoid Jarmaggeddon

• Reflection? Ugh.

Give m

e

/update/avro!

Tuesday, October 2, 2012

Page 34: OSSCON: Big Search 4 Big Data

Avro!

• Supports serialization of data readable from multiple languages

• It’s smart XML, w/o the XML!

• Handles forward and reverse versions of an object

• Compact and fast to read.

Tuesday, October 2, 2012

Page 35: OSSCON: Big Search 4 Big Data

Avro!

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

.avro

Tuesday, October 2, 2012

Page 36: OSSCON: Big Search 4 Big Data

Telling some stories

• Prototyping

• Application Development

•Maintaining Your Big Search Indexes

Tuesday, October 2, 2012

Page 37: OSSCON: Big Search 4 Big Data

Upgrade Lucene Indexes Easily

• Don’t reindex!

• Try out new versions of Lucene based search engines.

David Lyle

java -cp lucene-core.jar org.apache.lucene.index.IndexUpgrader [-delete-prior-commits] [-verbose] indexDir

Tuesday, October 2, 2012

Page 38: OSSCON: Big Search 4 Big Data

Indexing is Easy and Quick

Tuesday, October 2, 2012

Page 39: OSSCON: Big Search 4 Big Data

CHEAP AND CHEERFUL

><

Tuesday, October 2, 2012

Page 40: OSSCON: Big Search 4 Big Data

NRT versus BigData

Tuesday, October 2, 2012

Page 41: OSSCON: Big Search 4 Big Data

The tension between scale and update rate

10 million 100’s of millionsBad Place

Tuesday, October 2, 2012

Page 42: OSSCON: Big Search 4 Big Data

Grim ReaperTuesday, October 2, 2012

Page 43: OSSCON: Big Search 4 Big Data

Delayed Replication<requestHandler name="/replication" class="solr.ReplicationHandler" ><lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str></lst></requestHandler>

Tuesday, October 2, 2012

Page 44: OSSCON: Big Search 4 Big Data

Enable/Disable

• Solr-3301

Tuesday, October 2, 2012

Page 45: OSSCON: Big Search 4 Big Data

Enable/Disable

<requestHandler name="/admin/ping" class="solr.PingRequestHandler"><lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http://search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2</lst><lst name="defaults"> <str name="echoParams">all</str></lst><str name="healthcheckFile">server-enabled.txt</str></requestHandler>

Tuesday, October 2, 2012

Page 46: OSSCON: Big Search 4 Big Data

Provisioning

• Chef/Puppet

• ZooKeeper

• Have you versioned everything to build an index over again?

Tuesday, October 2, 2012

Page 47: OSSCON: Big Search 4 Big Data

TRADITIONAL ENVIRONMENT

Tuesday, October 2, 2012

Page 48: OSSCON: Big Search 4 Big Data

POOLED ENVIRONMENTthink Cloud!

Tuesday, October 2, 2012

Page 49: OSSCON: Big Search 4 Big Data

Do I need Failover?

• Can I build quickly?

• Do I have a reliable cluster of servers?

• Am I spread across data centers?

• Is sooo 90’s....

Tuesday, October 2, 2012

Page 50: OSSCON: Big Search 4 Big Data

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

Tuesday, October 2, 2012

Page 51: OSSCON: Big Search 4 Big Data

One more thought...

Tuesday, October 2, 2012

Page 52: OSSCON: Big Search 4 Big Data

Measuring the impact of our algorithms

changes is just getting harder with Big Data.

Tuesday, October 2, 2012

Page 53: OSSCON: Big Search 4 Big Data

Project SolrPanlTuesday, October 2, 2012

Page 54: OSSCON: Big Search 4 Big Data

Thank you!

Questions?

[email protected]

• @dep4b

• www.opensourceconnections.com

Tuesday, October 2, 2012