71

Solr Anti - patterns

Embed Size (px)

Citation preview

Solr Anti - patterns Rafał Kuć, Sematext Group, Inc.

@kucrafal

@sematext

http://sematext.com

About me

Sematext consultant & engineer

Solr.pl co-founder

Father & husband

The (not so) perfect migration

http://en.wikipedia.org/wiki/Bird_migration

http://www.likesbooks.com/aarafterhours/?p=750

From 3.1 to 4.10 (and hopefully not back)

March 2011 September 2014

The lonely solrconfig.xml

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" /> <requestHandler name="/update/javabin" class="solr.BinaryUpdateRequestHandler" /> <requestHandler name="/update/csv" class="solr.CSVRequestHandler" /> <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler" />

<luceneMatchVersion>LUCENE_31</luceneMatchVersion>

<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>

DOC

DOC

DOC

And faulty indexing

EXCEPTIONS :)

And faulty indexing

<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">400</int> <int name="QTime">0</int> </lst> <lst name="error"> <str name="msg">missing content stream</str> <int name="code">400</int> </lst> </response>

109173 [qtp1223685984-20] ERROR org.apache.solr.core.SolrCore ľ org.apache.solr.common.SolrException: missing content stream at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:647) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source)

Let’s make that right

<requestHandler name="/update" class="solr.UpdateRequestHandler" /> <requestHandler name="/update/json" class="solr.UpdateRequestHandler"> <lst name="defaults"> <str name="stream.contentType">application/json</str> </lst> </requestHandler>

<luceneMatchVersion>LUCENE_4.10.0</luceneMatchVersion>

<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>

<nrtMode>true</nrtMode>

<updateLog> <str name="dir"> ${solr.ulog.dir:} </str> </updateLog>

The old schema.xml

<fieldType name="int" class="solr.IntField" omitNorms="true"/> <fieldType name="long" class="solr.LongField" omitNorms="true"/> <fieldType name="float" class="solr.FloatField" omitNorms="true"/> <fieldType name="double" class="solr.DoubleField" omitNorms="true"/> <fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/> <fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/> <fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/> <fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/> <fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>

<fieldType name="int" class="solr.IntField" omitNorms="true"/> <fieldType name="long" class="solr.LongField" omitNorms="true"/> <fieldType name="float" class="solr.FloatField" omitNorms="true"/> <fieldType name="double" class="solr.DoubleField" omitNorms="true"/> <fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/> <fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/> <fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/> <fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/> <fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>

The old schema.xml

The new schema.xml

<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/> <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/> <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/> <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/> <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" positionIncrementGap="0"/> <fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" positionIncrementGap="0"/> <fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" positionIncrementGap="0"/> <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0"/> <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/>

Threads? What threads?

<Set name="ThreadPool"> <New class="org.eclipse.jetty.util.thread.QueuedThreadPool"> <Set name="minThreads">10</Set> <Set name="maxThreads">200</Set> <Set name="detailedDump">false</Set> </New> </Set>

I see deadlocks

Threads? What threads?

<Set name="ThreadPool"> <New class="org.eclipse.jetty.util.thread.QueuedThreadPool"> <Set name="minThreads">10</Set> <Set name="maxThreads">200</Set> <Set name="detailedDump">false</Set> </New> </Set>

OK, so now we can actually run queries

<Set name="ThreadPool"> <New class="org.eclipse.jetty.util.thread.QueuedThreadPool"> <Set name="minThreads">10</Set> <Set name="maxThreads">10000</Set> <Set name="detailedDump">false</Set> </New> </Set>

The ZooKeeper

The ZooKeeper

The ZooKeeper

The ZooKeeper

The ZooKeeper

The ZooKeeper – production

The ZooKeeper – production

-DzkHost=zk1:2181,zk2:2181,zk3:2181

The ZooKeeper – production

-DzkHost=zk1:2181,zk2:2181,zk3:2181

The ZooKeeper – production

-DzkHost=zk1:2181,zk2:2181,zk3:2181

The ZooKeeper – production

-DzkHost=zk1:2181,zk2:2181,zk3:2181

Let’s cache everything

<filterCache class="solr.LRUCache" size="1048576" initialSize="1048576" autowarmCount="524288"/>

<queryResultCache class="solr.LRUCache" size="1048576" initialSize="1048576" autowarmCount="524288"/> <documentCache class="solr.LRUCache"

size="1048576" initialSize="1048576" autowarmCount="0"/>

And now let’s look at the warmup times

And now let’s look at the warmup times

OK, show us the way „Mr. Consultant”

<filterCache class="solr.FastLRUCache" size="1024" initialSize="1024" autowarmCount="512"/>

<queryResultCache class="solr.LRUCache" size="16000" initialSize="16000" autowarmCount="8000"/> <documentCache class="solr.LRUCache"

size="16384" initialSize="16384" autowarmCount="0"/>

Let’s look at the warmup times again

Let’s look at the warmup times again

Bulks are for noobs

Application Application Application

Doc Doc Doc

Bulks are for noobs

Application Application Application

Doc Doc Doc

But let’s use bulks, just in case

But let’s use bulks, just in case

We need to refresh and hard commit

<autoCommit> <maxTime>1000</maxTime> <openSearcher>true</openSearcher> </autoCommit> <autoSoftCommit> <maxTime>1000</maxTime> </autoSoftCommit>

Maybe we should only refresh?

<autoCommit> <maxTime>60000</maxTime> <openSearcher>false</openSearcher> </autoCommit> <autoSoftCommit> <maxTime>1000</maxTime> </autoSoftCommit>

OK, let’s go easy with refreshing

<autoCommit> <maxTime>60000</maxTime> <openSearcher>false</openSearcher> </autoCommit> <autoSoftCommit> <maxTime>30000</maxTime> </autoSoftCommit>

But I really need all that data

curl -XGET 'localhost:8983/solr/select?q=*:*&start=3000000&rows=100'

<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">9418</int> <lst name="params"> <str name="start">3000000</str> <str name="q">*:*</str> <str name="rows">100</str> </lst> </lst> <result name="response" numFound="3284000" start="3000000"> . . . </result> </response>

But I really need all that data

curl -XGET 'localhost:8983/solr/select?q=*:*&start=3000000&rows=100'

<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">9418</int> <lst name="params"> <str name="start">3000000</str> <str name="q">*:*</str> <str name="rows">5</str> </lst> </lst> <result name="response" numFound="3284000" start="3000000"> . . . </result> </response>

But I really need all that data

<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="error"> <str name="msg">java.lang.OutOfMemoryError: Java heap space</str> <str name="trace">java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:796) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:448) . . . Caused by: java.lang.OutOfMemoryError: Java heap space . . . </str> <int name="code">500</int> </lst> </response>

curl -XGET 'localhost:8983/solr/select?q=*:*&start=3000000&rows=100'

But I really need all that data

Query

But I really need all that data

But I really need all that data

But I really need all that data

Response

Use the scroll Luke

curl -XGET 'localhost:8983/solr/select?q=*:*&cursorMark=*&sort=score+desc,id+desc'

Use the scroll Luke

curl -XGET 'localhost:8983/solr/select?q=*:*&cursorMark=*&sort=score+desc,id+desc'

<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">189</int> <lst name="params"> <str name="sort">score desc,id desc</str> <str name="q">*:*</str> <str name="cursorMark">*</str> </lst> </lst> <result name="response" numFound="3284000" start="0"> <doc> ... </doc> . . . </result> <str name="nextCursorMark">AoIIP4AAACY5OTk5OTA=</str> </response>

Use the scroll Luke

curl -XGET 'localhost:8983/solr/select?q=*:*&sort=score+desc,id+desc &cursorMark=AoIIP4AAACY5OTk5OTA='

Use the scroll Luke

curl -XGET 'localhost:8983/solr/select?q=*:*&sort=score+desc,id+desc &cursorMark=AoIIP4AAACY5OTk5OTA='

<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">184</int> <lst name="params"> <str name="sort">score desc,id desc</str> <str name="q">*:*</str> <str name="cursorMark">AoIIP4AAACY5OTk5OTA=</str> </lst> </lst> <result name="response" numFound="3284000" start="0"> <doc> ... </doc> . . . </result> <str name="nextCursorMark">AoIIP4AAACY5OTk5ODE=</str> </response>

Limiting faceting, why bother?

curl -XGET 'localhost:8983/solr/select?q=*:*&facet=true&facet.field=tag&… facet.limit=-1&facet.mincount=0'

Limiting faceting, why bother?

curl -XGET 'localhost:8983/solr/select?q=*:*&facet=true&facet.field=tag&… facet.limit=-1&facet.mincount=0'

<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">9967</int> <lst name="params"> ... </lst> </lst> <result name="response" numFound="3284000" start="0"> . . . </result> <lst name="facet_counts"> <lst name="facet_fields"> <lst name="tag"> ... </lst> </lst> </lst> </response>

Limiting faceting, why bother?

curl -XGET 'localhost:8983/solr/select?q=*:*&facet=true&facet.field=tag&… facet.limit=-1&facet.mincount=0'

<?xml version="1.0" encoding="UTF-8"?> <response> . . . <lst name="error"> <str name="msg">Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space</str> <str name="trace">org.apache.solr.common.SolrException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space . . . Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:685) . . . </str> <int name="code">500</int> </lst> </response>

Now let’s look at performance

Now let’s look at performance

Now let’s look at performance

Now let’s look at performance

Now let’s look at performance

Magic happens with small changes

curl -XGET 'localhost:8983/solr/select?q=*:*&facet=true&facet.field=tag&… facet.limit=100&facet.mincount=1'

Magic happens with small changes

Magic happens with small changes

Magic happens with small changes

Magic happens with small changes

Magic happens with small changes

Magic happens with small changes

Magic happens with small changes

Monitoring in production

http://sematext.com/spm/index.html

And remember…

<luceneMatchVersion> 3.1

</luceneMatchVersion>

Quick summary

http://www.soothetube.com/2013/12/29/thats-all-folks/

We are hiring!

Dig Search?

Dig Analytics?

Dig Big Data?

Dig Performance?

Dig Logging?

Dig working with and in open – source?

We’re hiring world – wide!

http://sematext.com/about/jobs.html