Scaling Solr with SolrCloud
Rafał Kuć – Sematext Group, Inc.@kucrafal @sematext sematext.com
Ta me…
Sematext consultant & engineerSolr.pl co-founderFather and husband
Solr History
2004
2006
2007
2008
2009
2010
2012
2013
Y. Seeley creates Solr
Incubator graduation
Solr 1.4 released
Solr 4.0 released
Solr 4.1 and counting
Lucene / Solr merge
Solr 1.3 released
Solr donated to ASF
The Past
Master – Slave Deployment
Application
Solr Master
Solr Slave Solr Slave Solr Slave Solr Slave
Master as SPOF
Application
Solr Slave Solr Slave Solr Slave Solr Slave
Solr Master
R
Replication Time
Indexing App
Solr Slave
Solr Slave
Solr Master
Solr Slave
Querying App
Solr Slave Solr Slave
Solr Master
Too Much for a Single Shard
Application
Solr MasterSolr Master
Solr Slave Solr SlaveSolr Slave Solr Slave
Solr Slave Solr Slave
Solr Master
Too Much for a Single Shard
Application
Solr Master
Solr Slave Solr SlaveSolr Slave Solr Slave
Solr Master
DocResponseResponse
Shar
d1, s
hard
2,
shar
d3Sh
ard1
, sha
rd2,
sh
ard3
Querying in Multi Master Deployment
Solr SlaveShard 2
Solr SlaveShard 3
Solr SlaveShard 1
Application
SolrCloud Comes Into Play
Basic Glossary
https://cwiki.apache.org/confluence/display/solr/SolrCloud+Glossary
Cluster
Node
Collection
Shard
Leader & Replica
Overseer
Apache ZooKeeperQuorum is required
Sample configuration
clientPort=2181dataDir=/usr/share/zookeeper/datatickTime=2000initLimit=10syncLimit=5server.1=192.168.1.1:2888:3888server.2=192.168.1.2:2888:3888server.3=192.168.1.3:2888:3888
ZooKeeper ZooKeeper ZooKeeper
Solr Instances
ZooKeeper ZooKeeper ZooKeeper
Solr Server Solr Server
-DzkHost=192.168.1.2:2181,192.168.1.1:2181,192.168.1.3:2181
Solr Server Solr Server
-DzkHost=192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181
-DzkHost=192.168.1.3:2181,192.168.1.1:2181,192.168.1.2:2181
-DzkHost=192.168.1.3:2181,192.168.1.1:2181,192.168.1.2:2181
Collection Creation
ZooKeeper ZooKeeper ZooKeeper
Solr Server Solr Server
Solr Server Solr Server$ cloud-scripts/zkcli.sh –cmd upconfig -zkhost 192.168.1.2:2181 -confdir /usr/share/config/revolution/conf -conf revolution
$ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=revolution&numShards=2&replicationFactor=1'
Solr Server
Single Collection Deployment
Solr Server
Solr Server Solr Server
Shard1
Application
Shard2
Collection with Replica
ZooKeeper ZooKeeper ZooKeeper
Solr Server Solr Server
Solr Server Solr Server$ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=revolution&numShards=2&replicationFactor=2'
Solr Server
Collection with Replicas
Solr Server
Solr Server Solr Server
Shard1 Replica
Shard2 Replica
Shard2Shard1
Application
Solr Server
Querying
Solr Server
QU
ERY
fl=id,sco
re
fl=id,score
Application
Id,score Id,scoreShard1 Shard2
Solr Server
Solr Server
Querying
Solr Server
Application
docdoc
get docs
get docsResults
Shard2Shard1
Solr Server
Shard and Replica Number
How your data looks
Expected data growth
Target performance
Target node number
Max number of nodes = number of shards * (number of replicas + 1)
Shard
Replica
ReplicaReplica
Replica
Shard
Shard
Replica
What should I go for?
More data? Shard
Replica Replica
ShardShard
ReplicaMore queries ? Replica Replica Replica
Custom Routing
Default (numShards present, pre 4.5)
Implicit (numShards not present, pre 4.5)
Solr ServerSolr Server
id=userB!3id=userA!2
Custom Routing Example
id=userA!1
Shard2Shard1
Querying Solr – Default Routing
Shard 1 Shard 2 Shard 3 Shard 4
Shard 5 Shard 6 Shard 7 Shard 8
Solr Collection
Application
Shard 1 Shard 2 Shard 3 Shard 4
Shard 5 Shard 6 Shard 7 Shard 8
Solr Collection
Application
Quering Solr – Custom Routing
q=revolution&_route_=userA!
Collection Manipulation CommandsCreate
Delete
Reload
Split
Create Alias
Delete Alias
Shard Creation/Deletionhttp://wiki.apache.org/solr/SolrCloud
Collection Creation
name
numShards
replicationFactor
maxShardsPerNode
createNodeSet
collection.configName
Collection Split Example
$ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=collection1&numShards=2&replicationFactor=1'
Collection Split Example
$ curl 'http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1'
Collection Aliasing
$ curl 'http://solr1:8983/solr/admin/collections? action=CREATEALIAS&name=weekly&collections=20131107,20131108,20131109,20131110,20131111,20131112,20131113'
$ curl 'http://solr1:8983/solr/admin/collections? action=DELETEALIAS&name=weekly'
$ curl 'http://solr1:8983/solr/weekly/select?q=revolution'
Caches
q=lucene+revolution
fq=city:Dublin
Solr Cache
Refreshed with IndexSearcher
Configurable
Different purposes
Different implementations
Filter Cache
q=*:*&fq={!cache=false}city:Dublin
q=*:*&fq={!frange l=0 u=10 cache=false cost=200}sum(price,pro)
q=lucene+revolution&fq=city:Dublin
<filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="128" />
q=lucene+revolution+city:Dublin
Document Cache
<documentCache class="solr.LRUCache" size="512" initialSize="512" />
Query Result Cache
q=lucene+revolution&fq=city:Dublin&sort=date+desc&start=0&rows=10
<queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="128"/>
q=lucene+revolution+city:Dublin&sort=date+desc&start=0&rows=10
<queryResultWindowSize>20</queryResultWindowSize>
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
Warming<listener event="newSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst><str name="q">*:*</str><str name="sort">date desc</str></lst> <lst><str name="q">keywords:* OR tags:*</str></lst> <lst><str name="q">*:*</str><str name="fq">active:*</str></lst> </arr></listener><listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst><str name="q">*:*</str><str name="sort">date desc</str></lst> <lst><str name="q">keywords:* OR tags:*</str></lst> <lst><str name="q">*:*</str><str name="fq">active:*</str></lst> </arr></listener><useColdSearcher>false</useColdSearcher>
The Right Directory
_0.fdt _0.fdx _0.fnm _0.nvd
_1.fdt _1.fdx _1.fnm _1.nvd
StandardDirectory
SimpleFSDirectory
NIOFSDirectory
MMapDirectory
NRTCachingDirectory
RAMDirectory <directoryFactory name="DirectoryFactory" class="solr.NRTCachingDirectoryFactory" />
Column oriented fields - DocValues
<field name="categories" type="string" indexed="false" stored="false" multiValued="true" docValues="true"/>
<field name="categories" type="string" indexed="false" stored="false" multiValued="true" docValues="true" docValuesFormat="Disk"/>
NRT compatible
Better compression than field cache
Can store data outside of JVM heap
Can improve things for dynamic indices
Segment Merge
a b c d e
Level 0 Level 1
cf g
Segment Merge Under Control
Merge policy
Merge scheduler
Merge factor
Merge policy configuration
Configuring Segment Merge
<mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">10</int> <int name="segmentsPerTier">10</int></mergePolicy>
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
<mergeFactor>10</mergeFactor>
<mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer"/>
Indexing Throughput Tuning
Maximum indexing threads
RAM buffer size
Maximum buffered documents
Bulk, bulks and bulks
CloudSolrServer
Autocommit
Cutting off unnecessary stuff
TransactionLog
<updateLog> <str name="dir">${solr.ulog.dir:}</str></updateLog>
Updates durability
Recovering peer replay
Performant Realtime Get
<requestHandler name="/get" class="solr.RealTimeGetHandler"></requestHandler>
Autocommit or Not?
<autoCommit> <maxTime>15000</maxTime> <maxDocs>1000</maxDocs> <openSearcher>false</openSearcher></autoCommit>
<autoSoftCommit> <maxTime>1000</maxTime> </autoSoftCommit>
Automatic data flush
Automatic index view refresh
Autocommit & openSearcher=true<autoCommit> <maxDocs>10</maxDocs> <openSearcher>true</openSearcher></autoCommit>
AutoSoftCommit & openSearcher=false<autoCommit> <maxDocs>1000</maxDocs> <openSearcher>false</openSearcher></autoCommit>
<autoSoftCommit> <maxDocs>10</maxDocs> </autoSoftCommit>
Postings Formats to the Rescue
Lucene 4.0 >= Flexible Indexing
Postings == docs, positions, payloads
Different postings formats available
<codecFactory class="solr.SchemaCodecFactory" />
<field name="id" type="string_pulsing" indexed="true" stored="true" />
<fieldType name="string_pulsing" class="solr.StrField" postingsFormat="Pulsing41" />
BloomPulsingSimple textDirectMemory
MonitoringCluster state
Nodes utilization
Memory usage
Cache utilization
Query response time
Warmup times
Garbage collector work
JMX and Solr
JMX and Solr
Administration Panel
Administration Panel
Monitoring with SPM
Monitoring with SPM
Other Monitoring Tools
Ganglia http://ganglia.sourceforge.net/
New Relic http://www.newrelic.com/
Opsview http://www.opsview.com
We Are Hiring !
Dig Search ?Dig Analytics ?Dig Big Data ?Dig Performance ?Dig working with and in open – source ?We’re hiring world – wide !
http://sematext.com/about/jobs.html
Rafał Kuć @kucrafal [email protected]
Sematext @sematext http://sematext.com http://blog.sematext.com
SPM discount code: LR2013SPM20
Thank You !
@ Sematext booth ;)