Log Analytics with Amazon Elasticsearch...

Preview:

Citation preview

Log Analytics with Amazon Elasticsearch

Service

Christoph Schmitter (csc@amazon.de)

What we'll cover

• Understanding Elasticsearch capabilities• Elasticsearch, the technology• Aggregations; ad-hoc analysis• Amazon Elasticsearch Service is a drop-in

replacement for self-managed Elasticsearch• Q&A

Understanding Elasticsearch capabilities

Scenario: Log data analytics

• Application monitoring and event diagnosis

• You need to monitor the performance of your application, web servers, and hardware

• You need easy to use, yet powerful data visualization tools to detect issues in near real-time

• You want the ability to dig into your logs in an intuitive, fine-grained way

• Kibana provides fast, easy visualization

Scenario: Batch data analytics

• Reporting and Analysis

• You are a mobile app developer• You have to monitor/manage users

across multiple app versions• You want to analyze and report on

usage and migration between app versions

• Use Kibana for dashboarding. Use the query API for deeper analysis

Scenario: Full-text search

• Traditional search

• Your application or website provides search capabilities over diverse documents

• You are tasked with making searchable this knowledge base and accessible

• You need key search features including text matching, faceting, filtering, fuzzy search, auto complete, and highlighting

• Use the query API to support application search

CloudTrail delivers API calls to you

• AWS API call monitoring

• You need to understand the changing landscape of your AWS resources

• You need to do security analysis and compliance auditing

• You want the ability to dig into your logs in an intuitive, fine-grained way

How Elasticsearch can help

• Combined with Kibana, Elasticsearch provides a tool for search, real-time analytics, and data visualization

Demo Architecture

Amazon CloudWatch

Logs

Amazon Elasticsearch Service

CloudTrailLogs

AWS Resources

Log lines

Demo:

Log Analytics

Elasticsearch the technology

Elasticsearch is like a database

SearchValueField

DocumentIndex

Cluster

Queries

DatabaseValueColumnRowTableDatabase

SQL

Documents are the core entityID

F1 Value

F2 Value

{"eventVersion": "1.03","eventTime": "2016-06-01T00:16:19Z","eventSource": "dynamodb.amazonaws.com","eventName": "DescribeStream","awsRegion": "eu-west-1","sourceIPAddress": "52.51.24.XX","userAgent": "leb-kcl-580935a6-5f94-4ce0-ac69-cdeb609ba16a,amazon-

kinesis-client-library-java-lambda_1.2.1, aws-internal/3","requestParameters": {

"streamArn": "arn:aws:dynamodb:eu-west-1:17816119XXXX:table/restaurant/stream/2016-04-08T18:07:53.837"

},"responseElements": null,"requestID": "KC608PH8POAF2I184E2SL1PS2FVV4KQNSO5AEMVJF66Q9ASUAAJG","eventID": "49b56379-903b-4f04-8ce5-d21bbfcf8ab3","eventType": "AwsApiCall","apiVersion": "2012-08-10","recipientAccountId": "17816119XXXX","userIdentity": {

"type": "AssumedRole","principalId":

"AROAJBQVRM7LN25CAHX7Y:awslambda_338_20160531233813522","arn": "arn:aws:sts::178161197791:assumed-role/geospatial-rec-

engine-ApplicationExecutionRole-9LPKB77QMR97/awslambda_338_20160531233813522", ...

Lucene provides text analysis and indexing

0 quick 1,3,51 brown 2,3,4,62 fox 1,7,93 lazy 2,84 dog 24

Term ID Term Postings

IndexWriter

IndexSearcher

Segment

Elsaticsearch query processing

Query

quickbrownfoxlazy

loremipsumdolorsit

Index Lookup

id: 216id: 305id: 486id: 713

Matches

Querylogic and post-filtering Scoring,

aggs

id: 713id: 305id: 486id: 216

Sorted matches(results)

Aggregations; ad-hoc analysis

Faceting: basic aggregation

• Query: shirt

Facets Carhartt (1092) Russell Athletic (1087) Dickies (954) RALPH LAUREN (823) Wrangler (701) Doublju (259) Levi's (12)

ID

F1 Value

F2 Value

Elasticsearch Aggregations

• Buckets – a collection of documents meeting some criterion

• Metrics – calculations on the content of buckets.

Bucket: time

Met

ric: c

ount

A more complicated aggregation

Bucket: ARNBucket: RegionBucket: eventNameMetric: Count

More kinds of aggregations

Buckets• Date histogram• Histogram• Range• Terms• Filters• Significant terms

Metrics• Count• Average• Sum• Min• Max• Std. Dev• Unique Count• Percentiles

Setting up your cluster

Shard 1 Shard 2 Shard 3{ { { { Shard 4

Shards: independent collections of documents

Id Id Id . . .

Documents

{ Index/Type

Deployment of indices to a cluster

• Index 1– Shard 1– Shard 2– Shard 3

• Index 2– Shard 1– Shard 2– Shard 3

Amazon ES cluster

123

123

123

123

Primary Replica

1

3

3

1

Instance 1,Master

2

1

1

2

Instance 2

3

2

2

3

Instance 3

Determining storage

• Data:Index ratio is typically close to 1:1• Add a replica, double the storage• Figure out data node count based on storage

– Current limits; 10T EBS, 32T instance store

Determining instance type

• Instance type is workload-dependent• T2; dev, test, QA• M3; solid performance• R3; heavier queries, aggs• I2; largest storage option

Best practices

• Take the minimum number of shards for 50G max data per shard

• Number of replicas = 1• For all prod workloads: use 3 dedicated masters• Use the _bulk API. Some ingest mechanisms do

this automatically• Increase index.refresh_interval for higher

throughput

Indexing strategy

Indexing strategy for streaming data

• Use an index per time period, typically index-per-day, high volume can go to index-per-hour

• Shard the index according to data size; use 50GB as a soft limit per shard

• Master nodes increase cluster stability

Index settings control sharding and more

curl -XPUT <endpoint>/<index>/_settings -d '{"number_of_shards" : 5,"number_of_replicas" : 1,"refresh_interval": "5s"

}'

Mappings control how data is indexed

curl -XPUT <endpoint>/<index> -d '{"mappings" : {

<type> : {"properties" : {

"eventName" : {"type" : "string", "index" : "not_analyzed" } } } }

}'

Index templates simplify mapping creation

curl -XPUT <endpoint>/_template/<name> -d '{"template" : "<wildcard e.g. cwl-*>","settings" : { "number_of_shards" : 2 },"mappings" : {

<type, e.g. _default_> : {"dynamic_templates" : [ {

<name> : { "index" : "not_analyzed" } } ]"properties" : {

"@timestamp" : { "type" : "date" } } }

}'

Don't forget the query API!

Direct access to the Elasticsearch API

• $ curl -XPUT https://<endpoint>/blog -d '{• "settings" : { "number_of_shards" : 3, "number_of_replicas" : 1 } }'

• $ curl -XPOST http://<endpoint>/blog/post/1 -d '{• "author":"jon handler",• "title":"Amazon ES Launch" }'

• $ curl -XPOST https://<endpoint>/blog/post/_bulk -d '• { "index" : { "_index" : "blog", "_type" : "post", "_id" : "2"}}• {"title":"Amazon ES for search", "author": "carl meadows"},• { "index" : { "_index":"blog", "_type":"post", "_id":"3" } }• { "title":"Analytics too", "author": "vivek sriram"}'

• $ curl -XGET http://<endpoint>/_search?q=ES• {"took":16,"timed_out":false,"_shards":{"total":3,"successful":3,"failed":0

},"hits":{"total":2,"max_score":0.13424811,"hits":[{"_index":"blog","_type":"post","_id":"1","_score":0.13424811,"_source":{"author":"jon handler", "title":"Amazon ES Launch" }},{"_index":"blog","_type":"post","_id":"2","_score":0.11506981,"_source":{"title":"Amazon ES for search", "author": "carl meadows"},}]}}

Elasticsearch is a full-featured search engine

• Built on Lucene, the popular, open-source library• Search structured and unstructured data with

complex, boolean queries• Supports common search features: geo search,

aggregations, highlighting, search suggestions, and more

Challenges with self-managed Elasticsearch

• Easy to get started, challenging to scale• Scaling ingest pipelines is difficult• Undifferentiated heavy lifting

Amazon Elasticsearch Service

Amazon ES overview

Amazon Route 53

Elastic LoadBalancingIAM

CloudWatch

Elasticsearch API

CloudTrail

Easy cluster configuration and reconfiguration

AWS

• Elasticsearch Version• Data nodes, count and type• Master nodes, count and type• Storage option – EBS/instance• HA option• Advanced options

High availability with Zone Awareness

Amazon ES cluster

1

3

Instance 1

2

1 2

Instance 2

3

2

1

Instance 3

Availability Zone 1 Availability Zone 2

2

1

Instance 4

3

3

Monitor with CloudWatch metrics

• FreeStorageSpace – monitor and alarm before the cluster runs out of space

• CPUUtilization – alarm at 80% CPU to signal the need to scale up

• ClusterStatus.yellow – check whether replication requires additional nodes

• JVMMemoryPressure – check instance type and count for sufficient resources

• MasterCPUUtilization – monitoring for master nodes is separated from data nodes

Logstash

REST

CWL Agent

EC2 Instances

Amazon Kinesis

AmazonRDS

AmazonDynamoDB

AmazonSQS

Queue

LogstashCluster

Amazon Elasticsearch

Service

Amazon CloudWatch

AWSLambda

AWSCloudTrail

Access Logs

Amazon VPC Flow

Logs

Amazon S3 bucket

AWS IoT

Amazon Kinesis Firehose

Integration with the AWS ecosystem

Amazon ECS

Security with IAM{

"Version": "2012-10-17","Statement": [{

"Sid": "","Effect": "Allow","Principal": {"AWS": "arn:aws:iam:123456789012:user/susan"

},"Action": [ "es:ESHttpGet", "es:ESHttpPut", "es:ESHttpPost",

"es:CreateElasticsearchDomain","es:ListDomainNames" ],

"Resource": "arn:aws:es:us-east-1:###:domain/logs-domain/<index>/*"

} ] }

Pay for compute and storage you use

• With Amazon Elasticsearch Service, you pay only for the compute and storage resources you use. AWS Free Tier for qualifying customers.

Wrap up

• Combined with Kibana, Elasticsearch provides search and visualization for streaming data and full-text use cases.

• Elasticsearch is based on Lucene, which reads and writes search indices

• Aggregations allow you to analyze your data, splitting into Buckets and computing Metrics

• Amazon Elasticsearch Service makes it easy to set up and manage your Elasticsearch cluster on AWS

• Amazon ES is a great way to get started with Elasticsearch!

Q&A

• Christoph Schmitter: csc@amazon.deSolutions Architect

• https://run.qwiklab.com/searches/elasticsearch

Demo Screenshots

Recommended