Download pdf - Scaling Solr with SolrCloud

Scaling Solr with SolrCloud

Rafał Kuć – Sematext Group, Inc. @kucrafal @sematext sematext.com

Ta me…

Sematext consultant & engineer Solr.pl co-founder Father and husband

Solr History

Y. Seeley creates Solr

Incubator graduation

Solr 1.4 released

Solr 4.0 released

Solr 4.1 and counting

Lucene / Solr merge

Solr 1.3 released

Solr donated to ASF

The Past

Master – Slave Deployment

Application

Solr Master

Solr Slave Solr Slave Solr Slave Solr Slave

Master as SPOF

Application


Solr Master

R

Replication Time

Indexing App Solr Slave

Solr Slave

Solr Master

Solr Slave

Querying App

Solr Slave Solr Slave

Solr Master

Too Much for a Single Shard

Application

Solr Master Solr Master


Solr Slave Solr Slave

Solr Master

Too Much for a Single Shard

Application

Solr Master


Solr Master

Doc Response Response

Shar

d1, s

hard

2,

shar

d3

Shar

d1, s

hard

2,

shar

d3

Querying in Multi Master Deployment

Solr Slave Shard 2

Solr Slave Shard 3

Solr Slave Shard 1

Application

SolrCloud Comes Into Play

Basic Glossary

https://cwiki.apache.org/confluence/display/solr/SolrCloud+Glossary

Cluster Node Collection Shard Leader & Replica Overseer

Apache ZooKeeper Quorum is required

Sample configuration clientPort=2181 dataDir=/usr/share/zookeeper/data tickTime=2000 initLimit=10 syncLimit=5 server.1=192.168.1.1:2888:3888 server.2=192.168.1.2:2888:3888 server.3=192.168.1.3:2888:3888

ZooKeeper ZooKeeper ZooKeeper

Solr Instances


Solr Server Solr Server

-DzkHost=192.168.1.2:2181, 192.168.1.1:2181,192.168.1.3:2181


-DzkHost=192.168.1.1:2181, 192.168.1.2:2181,192.168.1.3:2181

-DzkHost=192.168.1.3:2181, 192.168.1.1:2181,192.168.1.2:2181

-DzkHost=192.168.1.3:2181, 192.168.1.1:2181,192.168.1.2:2181

Collection Creation



Solr Server Solr Server $ cloud-scripts/zkcli.sh –cmd upconfig -zkhost 192.168.1.2:2181 -confdir /usr/share/config/revolution/conf -conf revolution

$ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=revolution&numShards=2&replicationFactor=1'

Solr Server

Single Collection Deployment

Solr Server


Shard1

Application

Shard2

Collection with Replica



Solr Server Solr Server $ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=revolution&numShards=2&replicationFactor=2'

Solr Server

Collection with Replicas

Solr Server


Shard1 Replica

Shard2 Replica

Shard2 Shard1

Application

Solr Server

Querying

Solr Server

QUE

RY

Application

Id,score Id,score Shard1 Shard2

Solr Server

Solr Server

Querying

Solr Server

Application

doc doc

Results

Shard2 Shard1

Solr Server

Shard and Replica Number

How your data looks Expected data growth Target performance Target node number

Max number of nodes = number of shards * (number of replicas + 1)

What should I go for?

More data? Shard

Replica Replica

Shard Shard

Replica More queries ? Replica Replica Replica

Custom Routing

Default (numShards present, pre 4.5) Implicit (numShards not present, pre 4.5)


id=userB!3 id=userA!2

Custom Routing Example

id=userA!1

Shard2 Shard1

Querying Solr – Default Routing

Shard 1 Shard 2 Shard 3 Shard 4


Solr Collection

Application



Solr Collection

Application

Quering Solr – Custom Routing

q=revolution&_route_=userA!

Collection Manipulation Commands Create Delete Reload Split Create Alias Delete Alias Shard Creation/Deletion

http://wiki.apache.org/solr/SolrCloud

Collection Creation

name numShards replicationFactor maxShardsPerNode createNodeSet collection.configName

Collection Split Example

$ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=collection1&numShards=2&replicationFactor=1'

Collection Split Example

$ curl 'http://localhost:8983/solr/admin/collections? action=SPLITSHARD&collection=collection1&shard=shard1'

Collection Aliasing

$ curl 'http://solr1:8983/solr/admin/collections? action=CREATEALIAS&name=weekly&collections=20131107,20131108,20131109,20131110,20131111,20131112,20131113'

$ curl 'http://solr1:8983/solr/admin/collections? action=DELETEALIAS&name=weekly'

$ curl 'http://solr1:8983/solr/weekly/select?q=revolution'

Caches

Solr Cache

Refreshed with IndexSearcher Configurable Different purposes Different implementations

Filter Cache

q=*:*&fq={!cache=false}city:Dublin

q=*:*&fq={!frange l=0 u=10 cache=false cost=200}sum(price,pro)

q=lucene+revolution&fq=city:Dublin

<filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="128" />

q=lucene+revolution+city:Dublin

Document Cache

<documentCache class="solr.LRUCache" size="512" initialSize="512" />

Query Result Cache

q=lucene+revolution&fq=city:Dublin&sort=date+desc&start=0&rows=10

<queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="128"/>

q=lucene+revolution+city:Dublin&sort=date+desc&start=0&rows=10

<queryResultWindowSize>20</queryResultWindowSize>

<queryResultMaxDocsCached>200</queryResultMaxDocsCached>

Warming <listener event="newSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst><str name="q">*:*</str><str name="sort">date desc</str></lst> <lst><str name="q">keywords:* OR tags:*</str></lst> <lst><str name="q">*:*</str><str name="fq">active:*</str></lst> </arr> </listener>

<listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst><str name="q">*:*</str><str name="sort">date desc</str></lst> <lst><str name="q">keywords:* OR tags:*</str></lst> <lst><str name="q">*:*</str><str name="fq">active:*</str></lst> </arr> </listener>

<useColdSearcher>false</useColdSearcher>

The Right Directory

_0.fdt _0.fdx _0.fnm _0.nvd

_1.fdt _1.fdx _1.fnm _1.nvd

StandardDirectory SimpleFSDirectory NIOFSDirectory MMapDirectory NRTCachingDirectory RAMDirectory <directoryFactory name="DirectoryFactory"

class="solr.NRTCachingDirectoryFactory" />

Column oriented fields - DocValues

<field name="categories" type="string" indexed="false" stored="false" multiValued="true" docValues="true"/>

<field name="categories" type="string" indexed="false" stored="false" multiValued="true" docValues="true" docValuesFormat="Disk"/>

NRT compatible Better compression than field cache Can store data outside of JVM heap Can improve things for dynamic indices

Segment Merge

a b c d e

Level 0 Level 1

c f g

Segment Merge Under Control

Merge policy Merge scheduler Merge factor Merge policy configuration

Configuring Segment Merge

<mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">10</int> <int name="segmentsPerTier">10</int> </mergePolicy>

<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>

<mergeFactor>10</mergeFactor>

<mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer"/>

Indexing Throughput Tuning Maximum indexing threads RAM buffer size Maximum buffered documents Bulk, bulks and bulks CloudSolrServer Autocommit Cutting off unnecessary stuff

TransactionLog

<updateLog> <str name="dir">${solr.ulog.dir:}</str> </updateLog>

Updates durability Recovering peer replay Performant Realtime Get

<requestHandler name="/get" class="solr.RealTimeGetHandler"> </requestHandler>

Autocommit or Not?

<autoCommit> <maxTime>15000</maxTime> <maxDocs>1000</maxDocs> <openSearcher>false</openSearcher> </autoCommit>

<autoSoftCommit> <maxTime>1000</maxTime> </autoSoftCommit>

Automatic data flush Automatic index view refresh

Autocommit & openSearcher=true <autoCommit> <maxDocs>10</maxDocs> <openSearcher>true</openSearcher> </autoCommit>

AutoSoftCommit & openSearcher=false <autoCommit> <maxDocs>1000</maxDocs> <openSearcher>false</openSearcher>

</autoCommit>

<autoSoftCommit> <maxDocs>10</maxDocs> </autoSoftCommit>

Postings Formats to the Rescue

Lucene 4.0 >= Flexible Indexing Postings == docs, positions, payloads Different postings formats available

<codecFactory class="solr.SchemaCodecFactory" />

<field name="id" type="string_pulsing" indexed="true" stored="true" /> <fieldType name="string_pulsing" class="solr.StrField" postingsFormat="Pulsing41" />

Bloom Pulsing Simple text Direct Memory

Monitoring Cluster state Nodes utilization Memory usage Cache utilization Query response time Warmup times Garbage collector work

JMX and Solr

JMX and Solr

Administration Panel

Administration Panel

Monitoring with SPM

Monitoring with SPM

Other Monitoring Tools

Ganglia http://ganglia.sourceforge.net/

New Relic http://www.newrelic.com/

Opsview http://www.opsview.com

We Are Hiring !

Dig Search ? Dig Analytics ? Dig Big Data ? Dig Performance ? Dig working with and in open – source ? We’re hiring world – wide ! http://sematext.com/about/jobs.html

http://sematext.com/about/jobs.html

Rafał Kuć @kucrafal [email protected] Sematext @sematext http://sematext.com http://blog.sematext.com SPM discount code: LR2013SPM20

Thank You !

@ Sematext booth ;)

https://twitter.com/kucrafal



https://twitter.com/sematext



http://sematext.com/



http://blog.sematext.com/