Toronto High Scalability meetup - Scaling ELK

Scaling Logging and Monitoring

JP Parkin ([email protected])

User Scenarios• Internal IBM Infrastructure– Relatively small number of groups that generate a ton of logs

( groups that want to generate 3-5 TB/day )– Logs are produced on VMs running various cloud services

operated by IBM• External Bluemix Log Producers– Relatively large number of groups ( Bluemix organizations ) that

generate a variety of log data but in relatively smaller quantities – total logs measured anywhere from kilobytes to gigabytes / day

– Only a handful of organizations are currently generating large volumes of log data

Bluemix Logging

Bluemix Metrics

Advanced View - Kibana

Advanced View - Grafana

Grafana – Build your own Dashboard

Service Architecture

Key facts • OpenStack Heat automation

– Multiple AutoScale Groups (ASGs)– Docker image per ASG– Ansible to configure s/w

• Currently deployed on OpenStack– Virtual Machines host single docker

container– Security groups for firewall rules– HAProxy for load balancing

Deployment and Automation• Open Stack deployment using heat templates

– Provides scale-up/down capabilities to add capacity when needed• Ansible configuration automation integrated with the heat

deployment to configure the nodes• Docker images are used as our standard deployment artifact

( configured by Ansible )• Jenkins jobs for building and testing the docker images• UCD automation for deployment and upgrade processing – provides

operational management for tracking what is deployed to each of the environments

• Mixture of Jenkins and UCD for jobs to manage the daily operations including item such as data expiration, index pre-creation and various health check scripts.

Node ConfigurationsSystem CPU Memory Java Heap Local Disk

Lumberjack 4 8 GB 5 GB 25 GB

Logstash 4 8 GB 5 GB 25 GB

Kafka 4 8 GB 3 GB 25 GB + 5 TB volume

Elasticsearch Master Node

10 32 GB 16 GB 25 GB

Elasticsearch Http Node

10 32 GB 16 GB 25 GB

ElasticsearchData Node

20 64 GB 30 GB 18 TB spinning local RAID disk

Multi-tenant Logstash Forwarder• Took the logstash forwarder and added multi-tenancy

capabilities• Similar changes to the logstash input lumberjack plugin• Fixed log rotation capabilities in the MT-LSF – was

triggering disk full problems on clients since it was holding locks on files for up to 24 hours before it timed out

• Found that increasing the spool size resulted in some performance improvement up to a certain point. 512 was a sweet spot, going to larger values ending up having worse performance.

Multi-tenant Lumberjack Server• Lumberjack server had issue with long-lasting connections

and file descriptor leaks that required frequent restarts under load

• Terminating connections on the client to get better server utilization ( forced load balancer switch), but didn’t resolve the underlying issue

• Logstash 1.5.2 lumberjack public solved the problem connection problems with a fix to the Jruby OpenSSL library which was encountering file descriptor leaks under load

• Switching the kafka output plugin to run with async gave some performance improvements ( 10-15% )

Logstash Lumberjack Performance• The great thing about logstash is that it’s a Swiss Army Knife for

solving data transformation problems• 12 Lumberjack servers in a cluster can process about 50 Mb /s ≈ 4.3

TB /day which is pretty good for most logging applications• If you are only utilizing the basic input / output functionality then

creating a specific task based solution can result in better performance

• We are prototyping a replacement logstash server to handle the processing of the mt-lumberjack and initial results are very good – in the area of 12x throughput improvement on the same hardware.

• The queuing mechanism that makes logstash very flexible turns out to also be one of the bottlenecks when stressing out the platform

Kafka

• Distributed messaging system for buffering log and metric data

• We keep 3 days worth of data to allow us to handle the input spikes and buffers logs when Elasticsearch or logstash indexers are not performing well

• Logs for Kafka itself can become quite large when errors occur, so getting the right logging settings are important

Logstash Indexers• Logstash Indexers are responsible for processing the log entries

and pushing the data to Elasticsearch • Stability of Logstash 1.4.2 plugins for ES was not good

– Tried all 3 protocols ( node, transport, http )– Node was fast but has issues when large metadata was transferred on ES

node failures ( frequent OOM )– Transport had reasonable performance and stability but did not have

multi-node support – Http has best performance after tuning to use a larger batch size, but did

not have multi-node support• Logstash 1.5.2 ES plugins all have multi-node support • Settled on the 1.5.2 Http protocol version running against

dedicated http client nodes in the cluster

Logstash Indexers

• Even with Logstash 1.5.2, the indexers are somewhat gated to the amount of data a single node can process

• Expanded the number of Kafka partitions to allow growth beyond the initial 19 partitions we had allocated for the logging topic.

• Logstash indexers can be scaled beyond 19 nodes in order to get to the point where we can stress the ES cluster

Indexing Log Data

• Relying on your users to be well behaved is dangerous – some logs contain what appears to be well formed json document with a GUID as a key and all of a sudden the field metadata explodes in ES

• Need to monitor which documents you run through the json filter in Logstash

• Adding filters to Logstash also slows down the indexing process especially if you are attempting to use many of the cool plugins

Elasticsearch

• If your network is a problem, then ES is not going to be happy

• Elasticsearch 1.4.4 did not react well to network blips – indexes would start shuffling themselves trying to proactively recover which generally resulted in long recovery times with default configurations

• The default recovery settings meant clusters remained red or yellow for extended periods which impacted the data ingestion

• Elasticsearch 1.7.1 has been much more stable for us

Sharding

• Pre-allocating the right number of shards for an index is hard if you don’t know how much data you are going to get

• Target that seems to work well is about 25 GB per shard• Problems with shard size is really highlighted when you

need to recover a failed node• How many shards can you put in an ES cluster?– We found 80k was too many -> changed how we allocated

shards based on historical usage – We think that for our clusters about 40k

Elasticsearch Configurations

• 3 master nodes• 10 data nodes per cluster• 3 http nodes per cluster for queries• 30 GB heap• 2 data replicas to allow 2 node failures• index.translog.flush_threshold_size = 1g• indices.fielddata.cache.size: 50%

Elasticsearch Recovery

• Increase the rate at which an index can recoverindices.recovery.max_bytes_per_sec: 200mb

• Increase the concurrent recoveries supportedcluster.routing.allocation.node_concurrent_recoveries: 500

• Having the Kafka cluster caching data provides us some windows where the data is delayed getting to the ES during recovery

Elasticsearch Load Testing

• Run client drivers to simulate traffic into the external stack

• We have a number of sample workloads from real tenants that we use in our workloads

• There are lots of knobs to tune ES so having some consistent workloads to validate our theories has been invaluable

Performance

• Clusters running in production can support up to around 70k records/sec ( 30 MB/s ) based on our monitoring

• In our performance environments we are seeing consistent numbers beyond 40 MB/s

• For larger indexes, increasing the number of shards provided – 50 GB of logs spread across 10 shards was loaded about 50% faster than with 5 shards

Adjust throttling for loading large indices

Baseline 20 mb Throttle 100mb Throttle none0

5

10

15

20

25

30

35

40

45

10 shards

Throttling Settings

MB/

sec

Scaling Elastic SearchMultiple Elastic Search Clusters

• Tenants get placed onto an ES cluster• Tribe nodes to federate access across ES clusters

– Enables massive tenants spanning ES clusters

Technology

Toronto High Scalability meetup - Scaling ELK