41

Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Embed Size (px)

Citation preview

Page 1: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Page 2: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Tuning Solr for Logs

Radu Gheorghe

@radu0gheorghe @sematext

Page 4: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Tuning. Is it worth it?

baseline last run

# of logs 10M 310M

EC2 bill/month 700 450

Page 5: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

What to optimize for?

http://www.seasonslogs.co.uk/images/products/SL_001.pnghttps://openclipart.org/image/300px/svg_to_png/169833/Server_1U.png

capacity: how many logs

the same hardware can keep

while still providing decent

performance

Page 6: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

What's decent performance? “It depends”

Assumptions

indexing: enough to keep up with generated logs*

search concurrency

search latency: 2s for debug queries, 5s for charts

*account for spikes!

Page 7: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Enough theory, let's start testing!

Solr instance

m3.2xlarge (8CPU, 30GB RAM, 2x80GB SSD)

Solr 4.10.1

Feeder instance

c3.2xlarge (8CPU, 15GB RAM, 2x80GB SSD)

apache access logs

python script to parse and feed them

Page 8: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Baseline test

15GB heap

debug query

status:404 in the last hour

charts query

all time status counters

all time top IPs

user agent word cloud

http://blog.sematext.com/2013/12/19/getting-started-with-logstash/

Page 9: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Baseline result

100K 2.5M 4M 6M 9M 10M0

2000

4000

6000

8000

10000

12000

debugchartsEPS

Page 10: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

100K 2.5M 4M 6M 9M 10M0

2000

4000

6000

8000

10000

12000

debugchartsEPS

Baseline result

capacity

Page 11: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

100K 2.5M 4M 6M 9M 10M0

2000

4000

6000

8000

10000

12000

debugchartsEPS

Baseline result

capacitybottleneck: facets eat CPU

Page 12: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

100K 2.5M 4M 6M 9M 10M0

2000

4000

6000

8000

10000

12000

debugchartsEPS

Baseline result

capacitybottleneck: facets eat CPU

on average,CPU is OK

Page 13: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

100K 2.5M 4M 6M 9M 10M0

2000

4000

6000

8000

10000

12000

debugchartsEPS

Baseline result

capacitybottleneck: facets eat CPU

indexing limitedbecause pythonscripts eatsfeeder CPU

on average,CPU is OK

Page 14: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Indexing throughput: is it enough?

“it depends”

how long do you keep your logs?

1M logs/day * 10 days <> 0.3M logs/day * 30 days. Both need 10M capacity

1M logs/day * 30 days? Needs 3 servers, each getting 0.3M logs/day

Baseline run: 10M index fills up in <1/2h at 7K EPS

Page 15: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Indexing throughput: is it enough?

“it depends”

how long do you keep your logs?

1M logs/day * 10 days <> 0.3M logs/day * 30 days. Both need 10M capacity

1M logs/day * 30 days? Needs 3 servers, each getting 0.3M logs/day

how big are your spikes? (assumption: 10x regular load)

7K EPS is enough for 10M capacity if you keep logs >5h

Page 16: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

1.5M 3M 5M 8M 11M0

1000

2000

3000

4000

5000

6000

7000

8000

chartsEPSdebug

Rare commits

10% above baseline

auto soft commits every 5 seconds

auto hard commits every 30 minutes

RAMBufferSize=200MB; maxBufferedDocs=10M

Page 17: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Same results with

even rarer commits (auto-soft every 30s, 500MB buffer)

omitNorms + omitTermFreqAndPositions

larger caches

cache autowarming

THP disabled

mergeFactor 5

mergeFactor 20

but indexingwas cheaper

manually ranqueries, too

Page 18: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

1.5M 3M 5M 8M 10M 12M0

1000

2000

3000

4000

5000

6000

7000

8000

chartsEPSdebug

DocValues on IP and status code

20% above baseline

Page 19: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

3M 10M 18M 24M 31M 36M0

10002000300040005000600070008000

chartsEPSdebug

Detour: what if user agent was string?

3.6x baseline

Page 20: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

8M 16M 24M 32M 40M 48M 56M 64M 67M 69M 70M 70.5M0

1000

2000

3000

4000

5000

6000

7000

8000

chartsEPSdebug

… and if user agent used DocValues?

6.7x baseline

reducing indexingadds 5% capacity

Page 21: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

3M 7M 11M 15M 19M 23M 27M 28M0

5000

10000

15000

20000

25000

30000

35000

chartsEPSdebug

Time based collections (1 minute)

2.7x baseline

OOM (150 collections)

Page 22: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

10M 40M 70M 100M 130M 160M 190M 213M0

1000

2000

3000

4000

5000

6000

7000

8000

chartsEPSdebug

Time based collections (10 minutes)

21x baseline

still OOM(~100 collections)

Page 23: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

50M 100M 150M 200M 250M 300M 310M 330M 340M0

1000

2000

3000

4000

5000

6000

7000

8000

chartsEPSdebug

10min collections: 20GB heap; optimize old

31x baseline,5 days projected retentionwith 10x spikes

no more OOM,just slower queries

34x baseline,10 days projectedretention (10x)

Page 24: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Software optimizations recap

Definitely worth it Nice to have I wouldn't bother

time-based collections

noop I/O scheduler merge policy tuning

DocValues omit norms, term frequencies and positions

autowarm

rare soft commits optimize “old” collections

super-rare soft commitsdisable THP

Page 25: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

20M 70M 120M 170M 220M 270M 320M 372M0

1000

2000

3000

4000

5000

6000

7000

chartsEPSdebug

r3.2xlarge: +30GB RAM, +$0.14/h, 1x160GB SSD

37x baseline,9 days projected retentionwith 10x spikes

less indexing throughputthan m3.2xlarge

Page 26: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

20M 50M 80M 110M 140M 170M 177M0

100020003000400050006000700080009000

chartsEPSdebug

c3.2xlarge: -15GB RAM, -$0.14/h

17x baseline,5 days projected retentionwith 10x spikes

Page 27: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Monthly EC2 cost per 1M logs*

m3.2xlarge: $1.3r3.2xlarge: $1.33c3.2xlarge: $1.78

TODO (a.k.a. truth always messes with simplicity): more/expensive facets => more CPU => c3 looks better less/cheap facets => not enough instance storage => EBS (magnetic/SSD/provisioned IOPS)? => storage-optimized i2? => old-gen instances with magnetic instance storage? use different instance types for “hot” and “cold” collections?

*on-demand pricing at 2014-11-07

Page 28: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

How NOT to build an indexing pipeline

custom script: reads apache logs from files

parses them using regex

takes 100% CPU and 100% RAM from a c3.2xlarge instance

maxes out at 7K EPS

Page 29: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Enter Apache Flume*

*Or Logstash. Or rsyslog. Or syslog-ng. Or any other specialized event processing tool

agent.sources = spoolSrcagent.sources.spoolSrc.type = spooldiragent.sources.spoolSrc.spoolDir = /var/log

agent.sources.spoolSrc.channels = solrChannelagent.channels = solrChannelagent.channels.solrChannel.type = fileagent.sinks.solrSink.channel = solrChannel

agent.sinks = solrSinkagent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSinkagent.sinks.solrSink.morphlineFile = conf/morphline.confagent.sinks.solrSink.morphlineId = 1

put Solr and Morphlinejars in lib/

channel

source

sink

Page 30: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

morphline.conf (think Unix pipes)morphlines : [ { id : 1 commands : [ { readLine { charset : UTF-8 } } { grok { dictionaryFiles : [conf/grok-patterns] expressions : { message : """%{COMBINEDAPACHELOG}""" } } } { generateUUID { field : id } } { loadSolr { solrLocator : { collection : collection1 solrUrl : "http://10.233.54.118:8983/solr/" } } } ] }]

same ID as in the flume.confsink definition

process one line at a time(there's also readMultiLine)

https://github.com/cloudera/search/blob/master/samples/solr-nrt/grok-dictionaries/grok-patterns

parses each property(eg: IP, status code)in its own field

Solr cando it, too*

use zkHostfor SolrCloud

*http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/

Page 31: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Result: 2.4K EPS, feeder machine almost idle

Page 32: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

2.4K EPS is typically enough for this

application server+ Flume agent

application server+ Flume agent

application server+ Flume agent

scales nicely with # of serversbut all buffering and processingis done here

Page 33: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

but not for this

application server+ Flume agent

application server+ Flume agent

application server+ Flume agent

centralized bufferingand processing

Flume agent

Flume agent

Page 34: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

or this

application server+ Flume agent

application server+ Flume agent

application server+ Flume agent

buffer, then process (separately)

Flume agent

Flume agent

Flume agent

Page 35: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Increase throughput: batch sizes; memory channel

agent.sources = spoolSrcagent.sources.spoolSrc.type = spooldiragent.sources.spoolSrc.spoolDir = /var/logagent.sources.spoolSrc.batchSize = 5000

agent.sources.spoolSrc.channels = solrChannelagent.channels = solrChannelagent.channels.solrChannel.type = file memoryagent.channels.solrChannel.capacity = 1000000agent.channels.solrChannel.transactionCapacity = 5000agent.sinks.solrSink.channel = solrChannel

agent.sinks = solrSinkagent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSinkagent.sinks.solrSink.morphlineFile = conf/morphline.confagent.sinks.solrSink.morphlineId = 1agent.sinks.solrSink.batchSize = 5000

solrLocator : { collection : collection1 solrUrl : "http://10.233.54.118:8983/solr/" batchSize : 5000}

make sure you have enough heap

Page 36: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Result: 10K EPS, 6%CPU usage (2x baseline)

Page 37: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

More throughput? Parallelize

Depends* on the bottleneck

*last time I use this word, I promise

source channel sink

more threads (if applicable)

more sources

multiplexingchannel selector

load balancingsink processor

more threads (if applicable)

Source1 C1

Source1

C1

Source2

Source1

C1

C2

C1 Sink1

C1

Sink1

Sink2

Page 38: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Result: default Solr install maxed out at 24K EPS

Page 39: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

TODO: log in JSON where you can

Then, in morphline.conf, replace the grok command with the much ligher: readJson {}

Easy with apache logs, maybe not for other apps:

LogFormat "{ \ \"@timestamp\": \"%{%Y-%m-%dT%H:%M:%S%z}t\", \ \"message\": \"%h %l %u %t \\\"%r\\\" %>s %b\", \... \"method\": \"%m\", \ \"referer\": \"%{Referer}i\", \ \"useragent\": \"%{User-agent}i\" \ }" ls_apache_jsonCustomLog /var/log/apache2/logstash_test.ls_json ls_apache_json

More details at:http://untergeek.com/2013/09/11/getting-apache-to-output-json-for-logstash-1-2-x/

Page 40: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Conclusions

Use time-based collections and DocValues

Rare soft&hard commits are good Pushing them too far is probably not worth it

Hardware: test and see what works for you A balanced, SSD-backed machine (like m3) is a good start

Use specialized event processing tools Apache Flume is a fine example

Processing and buffering on the application server side scales betterBuffer before [heavy] processingMind your batch sizes, buffer types and parallelizationLog in JSON where you can

Page 41: Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext

Thank you!

Feel free to poke me @radu0gheorghe

Check us out at the booth, sematext.com and @sematext

We're hiring, too!