Big Search 4 Big Data War Stories

Big Search 4 Big DataEnterprise Search Summit Europe 2013 London

Eric Pugh | [email protected] | @dep4b

1

mailto:[email protected]


Who am I?

• Principal of OpenSource Connections - Solr/Lucene Search Consultancy

• Member of Apache Software Foundation

• SOLR-284 UpdateRichDocuments (July 07)

• Fascinated by the art of software development

2

CO-AUTHORW

orking on 4.0!

3

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

war^

4

What is Big Search?5

http://www.flickr.com/photos/michaelgoodin/3256925948/



Background for Client X’s Project

• Big Data is any data set that is primarily at rest due to the difficulty of working with it.

• 100’s of millions of documents to search

• Aggressive timeline.

• All the data must be searched per query.

• Limited selection of tools available.

• On Solr 3.x line

6


• Prototyping



7

Boy meets Girl Story

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

8

Bash Rocks

9

Bash Rocks

• Remote Solr stop/start scripts

• Remote Indexer stop/start scripts

• Performance Monitoring

• Content Extraction scripts (+Java)

• Ingestor Scripts (+Java)

• Artifact Deployment (CM)

10

Lesson: Don’t get

captured by your

environment

11

Make it easy to change approach

12

Make it easy to change sharding

IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); }

Lesson: Sharpen

your axe

13

Go Wide Quickly

14

shard1shard1shard1shard1 :8983



search1.o19s.com



search1.o19s.com


search2.o19s.com


search3.o19s.com

Lesson: Hardw

are

is cheap/devs

expensive

15

Why so many pipelines?

16

Simple Pipeline

• Simple pipeline

• mv is atomic

Lesson: Simple

Works

17

Don’t Move Files

• SCP across machines is slow/error prone

• NFS share, single point of failure.

• Clustered file system like GFS (Global File System) can have “fencing” issues

• HDFS shines here.

• ZooKeeper shines here.

• Map/Reduce

18

Can you test your changes?

19

JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC

20

21

Run, don’t WalkLesson: Testing

needs to be easy

22


• Prototyping

•Application Development


23

Using Solr as key/value store

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

24

• thousands of queries per second without real time get.

• how fast with real time get?

http://localhost:8983/solr/select?q=id:DOC45242&fl=entities,html

http://localhost:8983/solr/get?id=DOC45242&fl=entities,html

Using Solr as key/value store

Lesson: Use w

hat

you have at hand

25

http://localhost:8983/solr/us_patent_grant/update/csv?skipLines=1&fieldnames=abstract,main-classification,invention-title,date-publ,description,id,further-classification,date-produced,claim-text&commit=true&optimize=true'








Don’t do expensive things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain

• Don’t duplicate work

26

Tika as a pipeline?

• Auto detects content type

• Metadata structure has all the key/value needed for Solr

• Allows us to scale up with Behemoth project.

• Ingest multiple XML formats as well as CSV and EDI

27


• Prototyping


•Maintaining Your Big Search Indexes

28

Indexing is Easy and Quick

29

CHEAP AND CHEERFUL

><

30

NRT versus BigData

31

The tension between scale and update rate

10 million 100’s of millionsBad Place

32

Grim Reaper33

Grim Reaper “Death of Mice”

Especially if you are on cloud platform. They implement their servers on the cheapest commodity hardware

Lesson: Embrace

failure, don’t fear

it

34

Provisioning

• Chef/Puppet

• ZooKeeper

• Have you versioned everything to build an index over again?

Lesson: Autom

ate

Everything!

35

TRADITIONAL ENVIRONMENT

36

POOLED ENVIRONMENTLesson: T

hink

Cloud

37

Building a Patents Index

0

75

150

225

300

5 days 3 days 30 Minutes

1 5

300

Mac

hine

Cou

nt

What happens when we want to index 2 million patents in 30 minutes?

38

Amazon AWS is Good but...

•EC2 is costly for your “base” load• Issues of access to internal data•Firewall and security

39

Do I need Failover?

• Can I build quickly?

• Do I have a reliable cluster of servers?

• Am I spread across data centers?

• Is sooo 90’s....

• Think shared nothing cluster!

Lesson: No!

40


• Prototyping



41

One more thought...

42

Measuring the impact of our algorithms

changes is just getting harder with Big Data.

43

www.solrpa.nl

Project SolrPanlProject SolrPanl

We need a

motivated beta

tester!

44

http://www.solrpa.nl

http://www.solrpa.nl

Thank you!

Questions?

• [email protected]

• @dep4b

• www.opensourceconnections.com

• slideshare.com/o19s

Nervous about speaking up? Ask

me later!

45



http://www.opensourceconnections.com




Technology

Big Search 4 Big Data War Stories