Scaling Solr with SolrCloud
Rafał Kuć – Sematext Group, Inc. @kucrafal @sematext sematext.com
Ta me…
Sematext consultant & engineer Solr.pl co-founder Father and husband
Solr History
Y. Seeley creates Solr
Incubator graduation
Solr 1.4 released
Solr 4.0 released
Solr 4.1 and counting
Lucene / Solr merge
Solr 1.3 released
Solr donated to ASF
The Past
Master – Slave Deployment
Application
Solr Master
Solr Slave Solr Slave Solr Slave Solr Slave
Master as SPOF
Application
Solr Slave Solr Slave Solr Slave Solr Slave
Solr Master
R
Replication Time
Indexing App Solr Slave
Solr Slave
Solr Master
Solr Slave
Querying App
Solr Slave Solr Slave
Solr Master
Too Much for a Single Shard
Application
Solr Master Solr Master
Solr Slave Solr Slave Solr Slave Solr Slave
Solr Slave Solr Slave
Solr Master
Too Much for a Single Shard
Application
Solr Master
Solr Slave Solr Slave Solr Slave Solr Slave
Solr Master
Doc Response Response
Shar
d1, s
hard
2,
shar
d3
Shar
d1, s
hard
2,
shar
d3
Querying in Multi Master Deployment
Solr Slave Shard 2
Solr Slave Shard 3
Solr Slave Shard 1
Application
SolrCloud Comes Into Play
Basic Glossary
https://cwiki.apache.org/confluence/display/solr/SolrCloud+Glossary
Cluster Node Collection Shard Leader & Replica Overseer
Apache ZooKeeper Quorum is required
Sample configuration clientPort=2181 dataDir=/usr/share/zookeeper/data tickTime=2000 initLimit=10 syncLimit=5 server.1=192.168.1.1:2888:3888 server.2=192.168.1.2:2888:3888 server.3=192.168.1.3:2888:3888
ZooKeeper ZooKeeper ZooKeeper
Solr Instances
ZooKeeper ZooKeeper ZooKeeper
Solr Server Solr Server
-DzkHost=192.168.1.2:2181, 192.168.1.1:2181,192.168.1.3:2181
Solr Server Solr Server
-DzkHost=192.168.1.1:2181, 192.168.1.2:2181,192.168.1.3:2181
-DzkHost=192.168.1.3:2181, 192.168.1.1:2181,192.168.1.2:2181
-DzkHost=192.168.1.3:2181, 192.168.1.1:2181,192.168.1.2:2181
Collection Creation
ZooKeeper ZooKeeper ZooKeeper
Solr Server Solr Server
Solr Server Solr Server $ cloud-scripts/zkcli.sh –cmd upconfig -zkhost 192.168.1.2:2181 -confdir /usr/share/config/revolution/conf -conf revolution
$ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=revolution&numShards=2&replicationFactor=1'
Solr Server
Single Collection Deployment
Solr Server
Solr Server Solr Server
Shard1
Application
Shard2
Collection with Replica
ZooKeeper ZooKeeper ZooKeeper
Solr Server Solr Server
Solr Server Solr Server $ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=revolution&numShards=2&replicationFactor=2'
Solr Server
Collection with Replicas
Solr Server
Solr Server Solr Server
Shard1 Replica
Shard2 Replica
Shard2 Shard1
Application
Solr Server
Querying
Solr Server
QUE
RY
Application
Id,score Id,score Shard1 Shard2
Solr Server
Solr Server
Querying
Solr Server
Application
doc doc
Results
Shard2 Shard1
Solr Server
Shard and Replica Number
How your data looks Expected data growth Target performance Target node number
Max number of nodes = number of shards * (number of replicas + 1)
What should I go for?
More data? Shard
Replica Replica
Shard Shard
Replica More queries ? Replica Replica Replica
Custom Routing
Default (numShards present, pre 4.5) Implicit (numShards not present, pre 4.5)
Solr Server Solr Server
id=userB!3 id=userA!2
Custom Routing Example
id=userA!1
Shard2 Shard1
Querying Solr – Default Routing
Shard 1 Shard 2 Shard 3 Shard 4
Shard 5 Shard 6 Shard 7 Shard 8
Solr Collection
Application
Shard 1 Shard 2 Shard 3 Shard 4
Shard 5 Shard 6 Shard 7 Shard 8
Solr Collection
Application
Quering Solr – Custom Routing
q=revolution&_route_=userA!
Collection Manipulation Commands Create Delete Reload Split Create Alias Delete Alias Shard Creation/Deletion
http://wiki.apache.org/solr/SolrCloud
Collection Creation
name numShards replicationFactor maxShardsPerNode createNodeSet collection.configName
Collection Split Example
$ curl 'http://solr1:8983/solr/admin/collections?action=CREATE&name=collection1&numShards=2&replicationFactor=1'
Collection Split Example
$ curl 'http://localhost:8983/solr/admin/collections? action=SPLITSHARD&collection=collection1&shard=shard1'
Collection Aliasing
$ curl 'http://solr1:8983/solr/admin/collections? action=CREATEALIAS&name=weekly&collections=20131107,20131108,20131109,20131110,20131111,20131112,20131113'
$ curl 'http://solr1:8983/solr/admin/collections? action=DELETEALIAS&name=weekly'
$ curl 'http://solr1:8983/solr/weekly/select?q=revolution'
Caches
Solr Cache
Refreshed with IndexSearcher Configurable Different purposes Different implementations
Filter Cache
q=*:*&fq={!cache=false}city:Dublin
q=*:*&fq={!frange l=0 u=10 cache=false cost=200}sum(price,pro)
q=lucene+revolution&fq=city:Dublin
<filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="128" />
q=lucene+revolution+city:Dublin
Document Cache
<documentCache class="solr.LRUCache" size="512" initialSize="512" />
Query Result Cache
q=lucene+revolution&fq=city:Dublin&sort=date+desc&start=0&rows=10
<queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="128"/>
q=lucene+revolution+city:Dublin&sort=date+desc&start=0&rows=10
<queryResultWindowSize>20</queryResultWindowSize>
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
Warming <listener event="newSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst><str name="q">*:*</str><str name="sort">date desc</str></lst> <lst><str name="q">keywords:* OR tags:*</str></lst> <lst><str name="q">*:*</str><str name="fq">active:*</str></lst> </arr> </listener>
<listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst><str name="q">*:*</str><str name="sort">date desc</str></lst> <lst><str name="q">keywords:* OR tags:*</str></lst> <lst><str name="q">*:*</str><str name="fq">active:*</str></lst> </arr> </listener>
<useColdSearcher>false</useColdSearcher>
The Right Directory
_0.fdt _0.fdx _0.fnm _0.nvd
_1.fdt _1.fdx _1.fnm _1.nvd
StandardDirectory SimpleFSDirectory NIOFSDirectory MMapDirectory NRTCachingDirectory RAMDirectory <directoryFactory name="DirectoryFactory"
class="solr.NRTCachingDirectoryFactory" />
Column oriented fields - DocValues
<field name="categories" type="string" indexed="false" stored="false" multiValued="true" docValues="true"/>
<field name="categories" type="string" indexed="false" stored="false" multiValued="true" docValues="true" docValuesFormat="Disk"/>
NRT compatible Better compression than field cache Can store data outside of JVM heap Can improve things for dynamic indices
Segment Merge
a b c d e
Level 0 Level 1
c f g
Segment Merge Under Control
Merge policy Merge scheduler Merge factor Merge policy configuration
Configuring Segment Merge
<mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">10</int> <int name="segmentsPerTier">10</int> </mergePolicy>
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
<mergeFactor>10</mergeFactor>
<mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer"/>
Indexing Throughput Tuning Maximum indexing threads RAM buffer size Maximum buffered documents Bulk, bulks and bulks CloudSolrServer Autocommit Cutting off unnecessary stuff
TransactionLog
<updateLog> <str name="dir">${solr.ulog.dir:}</str> </updateLog>
Updates durability Recovering peer replay Performant Realtime Get
<requestHandler name="/get" class="solr.RealTimeGetHandler"> </requestHandler>
Autocommit or Not?
<autoCommit> <maxTime>15000</maxTime> <maxDocs>1000</maxDocs> <openSearcher>false</openSearcher> </autoCommit>
<autoSoftCommit> <maxTime>1000</maxTime> </autoSoftCommit>
Automatic data flush Automatic index view refresh
Autocommit & openSearcher=true <autoCommit> <maxDocs>10</maxDocs> <openSearcher>true</openSearcher> </autoCommit>
AutoSoftCommit & openSearcher=false <autoCommit> <maxDocs>1000</maxDocs> <openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit> <maxDocs>10</maxDocs> </autoSoftCommit>
Postings Formats to the Rescue
Lucene 4.0 >= Flexible Indexing Postings == docs, positions, payloads Different postings formats available
<codecFactory class="solr.SchemaCodecFactory" />
<field name="id" type="string_pulsing" indexed="true" stored="true" /> <fieldType name="string_pulsing" class="solr.StrField" postingsFormat="Pulsing41" />
Bloom Pulsing Simple text Direct Memory
Monitoring Cluster state Nodes utilization Memory usage Cache utilization Query response time Warmup times Garbage collector work
JMX and Solr
JMX and Solr
Administration Panel
Administration Panel
Monitoring with SPM
Monitoring with SPM
Other Monitoring Tools
Ganglia http://ganglia.sourceforge.net/
New Relic http://www.newrelic.com/
Opsview http://www.opsview.com
We Are Hiring !
Dig Search ? Dig Analytics ? Dig Big Data ? Dig Performance ? Dig working with and in open – source ? We’re hiring world – wide ! http://sematext.com/about/jobs.html
Rafał Kuć @kucrafal [email protected] Sematext @sematext http://sematext.com http://blog.sematext.com SPM discount code: LR2013SPM20
Thank You !
@ Sematext booth ;)