Upload
opensource-connections
View
118
Download
3
Tags:
Embed Size (px)
DESCRIPTION
At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.
Citation preview
Big Search w/ Big Data Principles
Basis Technology Open Source Search 2012 Eric Pugh | [email protected] | @dep4b
Tuesday, October 2, 2012
What is Big Search?Tuesday, October 2, 2012
Who am I?
• Principal of OpenSource Connections - Solr/Lucene Search Consultancy
• Member of Apache Software Foundation
• SOLR-284 UpdateRichDocuments (July 07)
• Fascinated by the art of software development
Tuesday, October 2, 2012
CO-AUTHOR
2nd edition!
Tuesday, October 2, 2012
Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
war^
Tuesday, October 2, 2012
Not an intro to SolrCloud!
• Great tutorials given by Tomás Fernández Löbbe from LucidWorks yesterday!
Tuesday, October 2, 2012
Background for Client X’s Project
• Big Data is any data set that is primarily at rest due to the difficulty of working with it.
• 100’s of millions of documents to search
• Limited selection of tools available.
• Aggressive timeline.
• All the data must be searched per query.
• On Solr 3.x line
Tuesday, October 2, 2012
Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
Tuesday, October 2, 2012
Boy meets Girl Story
Metadata
Content Files
IngestPipeline
SolrSolrSolrSolr
Tuesday, October 2, 2012
Bash Rocks
Tuesday, October 2, 2012
Bash Rocks
• Remote Solr stop/start scripts
• Remote Indexer stop/start scripts
• Performance Monitoring
• Content Extraction scripts (+Java)
• Ingestor Scripts (+Java)
• Artifact Deployment (CM)
Tuesday, October 2, 2012
Make it easy to change approach
Tuesday, October 2, 2012
Make it easy to change sharding
public void run(Map options, List<SolrInputDocument> docs) throws InstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } }
Tuesday, October 2, 2012
Separate JVM from Solr Cores
• Step 1: Fire up empty Solr’s on all the servers (nohup &).
• Step 2: Verify they started cleanly.
• Step 3: Create Cores (curl http://search1.o19s.com:8983/solr/admin?action=create&name=run2)
• Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=)
Tuesday, October 2, 2012
Go Wide Quickly
Tuesday, October 2, 2012
shard1shard1shard1shard1 :8983
shard1shard1shard1shard8 :8984
shard1shard1shard1shard12 :8985
search1.o19s.com
shard1shard1shard1shard12 :8985
shard1shard1shard1shard1 :8983
search1.o19s.com
shard1shard1shard1shard8 :8983
search2.o19s.com
shard1shard1shard1shard12 :8983
search3.o19s.com
Tuesday, October 2, 2012
Simple Pipeline
• Simple pipeline
• mv is atomic
Tuesday, October 2, 2012
Don’t Move Files
• SCP across machines is slow/error prone
• NFS share, single point of failure.
• Clustered file system like GFS (Global File System) can have “fencing” issues
• HDFS shines here.
• ZooKeeper shines here.
Tuesday, October 2, 2012
Can you test your changes?
Tuesday, October 2, 2012
JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC
Tuesday, October 2, 2012
Tuesday, October 2, 2012
Run, don’t Walk
Tuesday, October 2, 2012
Telling some stories
• Prototyping
•Application Development
• Maintaining Your Big Search Indexes
Tuesday, October 2, 2012
Using Solr as key/value store
Metadata
Content Files
IngestPipeline
SolrSolrSolrSolr
Solr Key/Value Cache
Tuesday, October 2, 2012
• thousands of queries per second without real time get.
• how fast with real time get?
http://localhost:8983/solr/run2_enrichment/select?q=id:DOC45242&fl=entities,html
http://localhost:8983/solr/run2_enrichment/get?id=DOC45242&fl=entities,html
Using Solr as key/value store
Tuesday, October 2, 2012
Push schema definition to the application
• Not “schema less”
• Just different owner of schema!
• Schema may have common set of fields like id, type, timestamp, version
• Nothing required.
q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor
Tuesday, October 2, 2012
Don’t do expensive things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
Tuesday, October 2, 2012
Don’t do expensive things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
Tuesday, October 2, 2012
Beware JavaBin
Metadata
Content Files
IngestPipeline
SolrSolrSolrSolr
Solr Key/Value Cache
Tuesday, October 2, 2012
Beware JavaBin
Metadata
Content Files
IngestPipeline
SolrSolrSolrSolr
Solr Key/Value Cache
Solr 3.4
Tuesday, October 2, 2012
Beware JavaBin
Metadata
Content Files
IngestPipeline
SolrSolrSolrSolr
Solr Key/Value Cache
Solr 3.4
Solr 4
Tuesday, October 2, 2012
Beware JavaBin
Metadata
Content Files
IngestPipeline
SolrSolrSolrSolr
Solr Key/Value Cache
Solr 3.4
Solr 4
Which SolrJ version do I
use?
Tuesday, October 2, 2012
No JavaBin
• Avoid Jarmaggeddon
• Reflection? Ugh.
Give m
e
/update/avro!
Tuesday, October 2, 2012
Avro!
• Supports serialization of data readable from multiple languages
• It’s smart XML, w/o the XML!
• Handles forward and reverse versions of an object
• Compact and fast to read.
Tuesday, October 2, 2012
Avro!
Metadata
Content Files
IngestPipeline
SolrSolrSolrSolr
Solr Key/Value Cache
.avro
Tuesday, October 2, 2012
Telling some stories
• Prototyping
• Application Development
•Maintaining Your Big Search Indexes
Tuesday, October 2, 2012
Upgrade Lucene Indexes Easily
• Don’t reindex!
• Try out new versions of Lucene based search engines.
David Lyle
java -cp lucene-core.jar org.apache.lucene.index.IndexUpgrader [-delete-prior-commits] [-verbose] indexDir
Tuesday, October 2, 2012
Indexing is Easy and Quick
Tuesday, October 2, 2012
CHEAP AND CHEERFUL
><
Tuesday, October 2, 2012
NRT versus BigData
Tuesday, October 2, 2012
The tension between scale and update rate
10 million 100’s of millionsBad Place
Tuesday, October 2, 2012
Grim ReaperTuesday, October 2, 2012
Delayed Replication<requestHandler name="/replication" class="solr.ReplicationHandler" ><lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str></lst></requestHandler>
Tuesday, October 2, 2012
Enable/Disable
• Solr-3301
Tuesday, October 2, 2012
Enable/Disable
<requestHandler name="/admin/ping" class="solr.PingRequestHandler"><lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http://search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2</lst><lst name="defaults"> <str name="echoParams">all</str></lst><str name="healthcheckFile">server-enabled.txt</str></requestHandler>
Tuesday, October 2, 2012
Provisioning
• Chef/Puppet
• ZooKeeper
• Have you versioned everything to build an index over again?
Tuesday, October 2, 2012
TRADITIONAL ENVIRONMENT
Tuesday, October 2, 2012
POOLED ENVIRONMENTthink Cloud!
Tuesday, October 2, 2012
Do I need Failover?
• Can I build quickly?
• Do I have a reliable cluster of servers?
• Am I spread across data centers?
• Is sooo 90’s....
Tuesday, October 2, 2012
Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
Tuesday, October 2, 2012
One more thought...
Tuesday, October 2, 2012
Measuring the impact of our algorithms
changes is just getting harder with Big Data.
Tuesday, October 2, 2012
Project SolrPanlTuesday, October 2, 2012
Thank you!
Questions?
• @dep4b
• www.opensourceconnections.com
Tuesday, October 2, 2012