ELK Wrestling (Leeds DevOps)

Preview:

DESCRIPTION

Talk I did on log aggregation with the ELK stack at Leeds DevOps. Covers how we process over 800,000 logs per hour at laterooms, and the cultural changes this has helped drive.

Citation preview

{title: ‘ELK Wrestling’,author: ‘Steve Elliott’,company: ‘LateRooms.com’,type: ‘DevOpsLeeds,@timestamp: ‘2014-10-13T18:30Z’

}

Featuring Live Demo!

Please tweet!Include: “leedsdevops”

Home growing a metrics cultureNeeded visibility of live issuesHad trialled off the shelf before (Splunk)Hadn’t gained tractionWanted the data still

Options...

Tried Splunk

...Bit pricey, pay for HW and volume of data indexed

Looked at cloud based options, were also expensive

It started with Badger...

Logging and Monitoring Project

Locate and implement the tools we neededStarted with Cube for metrics (wouldn’t recommend)Moved onto Logging

Current tooling...

...Lacking

“But it works”

What can we log?Pretty much anything with a timestamp

Error logWeb logsProxy logsReleases?Tweets?

Logstash

ELK

High level architectural design

Web servers

QueueElasticsearch

Dashboards

Rest of Badger

Real time search and analytics database

Who’s using it?

...Clever people

Certain other hotel website...

Working with Elasticsearch

● RESTful API● JSON● Many libraries to deal with it (new on

ElasticLinq for C#)

Sense Chrome Extension

Clustering

Excellent distributed featuresEasy to useNode Self discoveryDifferent Node Types

(Data, Master, Search, Client)

“Live”SSD

“Archive”HDD

More in depth architecture

IISLogs

Errors

WMI

Collector(e.g. Live Server)

Queue Forwarder

Cube (/TSDB)

Search Analytics

Rabbit MQFilter & Forward

Logstash

Inputs

Filters

Outputs

e.g.HTTP logs, UDP, error logs, tweets.

e.g. UDP, elasticsearch, graphite, IRC

(e.g. Filter, grok, lookup IP, magic…)

Why the Queue?

● Resiliancy● Single source of data for everyone● Logstash used to recommend RabbitMQ,

now they recommend Redis● We still use RabbitMQ, works for us

Kibana

● Easy to build dashboards● Gateway drug to ElasticSearch queries● Examples!

But...

Demo

Mistake: Dashboard Fatigue

Too many dashboards to watch!Need to do more on alerting

Mistake: Using elasticsearch as a TSDB

Lots of graphs just cared about top level values, should use a TSDB (such as graphite) instead

Elasticsearch use case for more in-depth data analysis

Mistake: Trying to keep too much data

● Nodes going out of memory or disk space is bad

● Long GC can cause nodes to drop● Can lead to split brain● More shards = more memory ● usage, watch your scaling

Scaling

Hit two bottlenecks- Ingestion (solved with SSDs)- Search (solved by scaling horizontally)1.4.0 brings stability improvements, should handle oom better

Other MistakesShould have automated sooner(Good chef/puppet support)

Should have used “normal” logstash more

More node

More awesome??

What went right?

● Free and easy access to Data● Doesn’t need to be on elasticsearch, but the

tooling makes it easy● Give people access and they’ll seek out the

data to drive decisions - start the feedback loop

● Dev/Test instance

ELK in the wild

Data Driven QA

Data Driven...Managering

But wait, theres more!

Curator, Kibana 4 (Woo - aggregations), alerting, linkinglogs together…

Too much to cover here!

Thanks for Listening!

More: elasticsearch.org, logstash.net Blog: www.tegud.netTwitter: @tegudGithub: www.github.com/tegud

Come say hi!

Recommended