Upload
opensource-connections
View
504
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Some lessons that we learned in rolling out a search engine across a very big set of data.
Citation preview
Big Search 4 Big DataEnterprise Search Summit Europe 2013 London
Eric Pugh | [email protected] | @dep4b
1
Who am I?
• Principal of OpenSource Connections - Solr/Lucene Search Consultancy
• Member of Apache Software Foundation
• SOLR-284 UpdateRichDocuments (July 07)
• Fascinated by the art of software development
2
CO-AUTHORW
orking on 4.0!
3
Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
war^
4
What is Big Search?5
Background for Client X’s Project
• Big Data is any data set that is primarily at rest due to the difficulty of working with it.
• 100’s of millions of documents to search
• Aggressive timeline.
• All the data must be searched per query.
• Limited selection of tools available.
• On Solr 3.x line
6
Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
7
Boy meets Girl Story
Metadata
Content Files
IngestPipeline
SolrSolrSolrSolr
8
Bash Rocks
9
Bash Rocks
• Remote Solr stop/start scripts
• Remote Indexer stop/start scripts
• Performance Monitoring
• Content Extraction scripts (+Java)
• Ingestor Scripts (+Java)
• Artifact Deployment (CM)
10
Lesson: Don’t get
captured by your
environment
11
Make it easy to change approach
12
Make it easy to change sharding
IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); }
Lesson: Sharpen
your axe
13
Go Wide Quickly
14
shard1shard1shard1shard1 :8983
shard1shard1shard1shard8 :8984
shard1shard1shard1shard12 :8985
search1.o19s.com
shard1shard1shard1shard12 :8985
shard1shard1shard1shard1 :8983
search1.o19s.com
shard1shard1shard1shard8 :8983
search2.o19s.com
shard1shard1shard1shard12 :8983
search3.o19s.com
Lesson: Hardw
are
is cheap/devs
expensive
15
Why so many pipelines?
16
Simple Pipeline
• Simple pipeline
• mv is atomic
Lesson: Simple
Works
17
Don’t Move Files
• SCP across machines is slow/error prone
• NFS share, single point of failure.
• Clustered file system like GFS (Global File System) can have “fencing” issues
• HDFS shines here.
• ZooKeeper shines here.
• Map/Reduce
18
Can you test your changes?
19
JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC
20
21
Run, don’t WalkLesson: Testing
needs to be easy
22
Telling some stories
• Prototyping
•Application Development
• Maintaining Your Big Search Indexes
23
Using Solr as key/value store
Metadata
Content Files
IngestPipeline
SolrSolrSolrSolr
Solr Key/Value Cache
24
• thousands of queries per second without real time get.
• how fast with real time get?
http://localhost:8983/solr/select?q=id:DOC45242&fl=entities,html
http://localhost:8983/solr/get?id=DOC45242&fl=entities,html
Using Solr as key/value store
Lesson: Use w
hat
you have at hand
25
Don’t do expensive things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
• Don’t duplicate work
26
Tika as a pipeline?
• Auto detects content type
• Metadata structure has all the key/value needed for Solr
• Allows us to scale up with Behemoth project.
• Ingest multiple XML formats as well as CSV and EDI
27
Telling some stories
• Prototyping
• Application Development
•Maintaining Your Big Search Indexes
28
Indexing is Easy and Quick
29
CHEAP AND CHEERFUL
><
30
NRT versus BigData
31
The tension between scale and update rate
10 million 100’s of millionsBad Place
32
Grim Reaper33
Grim Reaper “Death of Mice”
Especially if you are on cloud platform. They implement their servers on the cheapest commodity hardware
Lesson: Embrace
failure, don’t fear
it
34
Provisioning
• Chef/Puppet
• ZooKeeper
• Have you versioned everything to build an index over again?
Lesson: Autom
ate
Everything!
35
TRADITIONAL ENVIRONMENT
36
POOLED ENVIRONMENTLesson: T
hink
Cloud
37
Building a Patents Index
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
Mac
hine
Cou
nt
What happens when we want to index 2 million patents in 30 minutes?
38
Amazon AWS is Good but...
•EC2 is costly for your “base” load• Issues of access to internal data•Firewall and security
39
Do I need Failover?
• Can I build quickly?
• Do I have a reliable cluster of servers?
• Am I spread across data centers?
• Is sooo 90’s....
• Think shared nothing cluster!
Lesson: No!
40
Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
41
One more thought...
42
Measuring the impact of our algorithms
changes is just getting harder with Big Data.
43
www.solrpa.nl
Project SolrPanlProject SolrPanl
We need a
motivated beta
tester!
44
Thank you!
Questions?
• @dep4b
• www.opensourceconnections.com
• slideshare.com/o19s
Nervous about speaking up? Ask
me later!
45