69
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. ARC 306: Lumberjacking on AWS Cutting Through Logs to Find What Matters Guy Ernest, Solutions Architecture November 15, 2013

Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Embed Size (px)

DESCRIPTION

AWS offers services that revolutionize the scale and cost for customers to extract information from large data sets, commonly called Big Data. This session analyzes Amazon CloudFront logs combined with additional structured data as a scenario for correlating log and transactional data. Successfully implementing this type of solution requires architects and developers to assemble a set of services with multiple decision points. The session provides a design and example of architecting and implementing the scenario using Amazon S3, AWS Data Pipeline, Amazon Elastic MapReduce, and Amazon Redshift. It explores loading, query performance, security, incremental updates, and design trade-off decisions.

Citation preview

Page 1: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

ARC 306: Lumberjacking on AWS

Cutting Through Logs to Find What Matters

Guy Ernest, Solutions Architecture

November 15, 2013

Page 2: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 3: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 4: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Progress Is Not Evenly Distributed

1980 Today

$14,000,000/TB

100 MB

4 MB/s

$30/TB

3 TB

200 MB/s

30,000 X

50 X

450,000 ÷

Page 5: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Solution: More Spindles by Kheel Center, Cornell University

Page 6: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Case Study – Foursquare

Page 7: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

The Challenge

“…Foursquare streams hundreds

of millions of application logs

each day. The company relies on

analytics to report on its daily

usage, evaluate new offerings,

and perform long-term trend

analysis—and with millions of

new check-ins each day, the

workload is only growing…”

Page 8: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

“Real” Project Requirements Example

Cost Analysis

Data transfer

• By date/time

• By edge location

• By date/time within an edge location

• By top X URLs

• By HTTP vs. HTTPS

Marketing

Top URLs

• As-is count

• By content type

• By edge location

• By edge location and content type

Requests served

• By edge location

Revenue

• By edge location

Top games

• By age

• By income

• By gender

Operations

Error rates

• By top X URLs

• By edge location

• By edge location and content type

Revenue

Top games

• By revenue

• By edge location and revenue

Top ads

• That lead to a game purchase

Page 9: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Viable Business

# Users

$ Money

Operation Costs

Revenues

Page 10: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Available Data Sources Metric Sources

Data transfer by date/time CloudFront logs

Data transfer by edge location CloudFront logs

Data transfer by date/time within an edge location CloudFront logs

Data transfer by top x URLs CloudFront logs, web servers logs

Data transfer by http vs HTTPS CloudFront logs

Top URLs CloudFront logs, web servers logs

Top URLs by Content Type CloudFront logs

Top URLs by Edge Location CloudFront logs

Top URLs by Edge Location and Content Type CloudFront logs

Error rates by top x URLs CloudFront logs, web servers logs

Error rate by edge location CloudFront logs

Error Rate by edge location and content type CloudFront logs

Requests served by edge location CloudFront logs

Revenue by edge location CloudFront logs, OrdersDB, app servers logs

Top games segmented by age CloudFront logs, user profile

Top games segmented by income CloudFront logs, user profile

Top games segmented by gender CloudFront logs, user profile

Top games by revenue CloudFront logs, OrdersDB

Top games by edge location and revenue CloudFront logs, OrdersDB

Top game revenue segmented by age CloudFront logs, OrdersDB, user profile

Page 11: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

CloudFront Access Log Format #Version: 1.0

#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query

2012-05-25 22:01:30 AMS1 4448 94.212.249.78 GET d1234567890213.cloudfront.net /YT0KthT/F5SOWdDPqNqQF07tiTOXqJMpfD\

dlb3LMwv3/jP3/CINm/yDSy0MsRcWJN/Simutrans.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M\

SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625181

2012-05-25 22:01:30 AMS1 4952 94.212.249.78 GET d1234567890213.cloudfront.net /66IG584/CPCxY0P44BGb5ZOd3qSUrauL05\

0LOvFwaMj/eH/caw/Blob Wars-Blob And Conquer.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M\

SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625184

2012-05-25 22:01:30 AMS1 4556 78.8.5.135 GET d1234567890213.cloudfront.net /SwlufjC/xEjH3BRbXMXwmFWqzKt7od6tlW\

R3e13LhmH/V3eF/lo6g/AstroMenace.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U;%20pl)%2\

0Presto/2.10.229%20Version/11.60 uid=100&oid=108625189

2012-05-25 22:01:30 AMS1 47172 78.8.5.135 GET d1234567890213.cloudfront.net /Di1cXoN/TskldkSHcgkvZXQEmv5vOVR25X\

5UTisFkRq/pQa/wCjUXZb/Z1HRuGlo/Kroz.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U;\

%20pl)%20Presto/2.10.229%20Version/11.60 uid=100&oid=108625206

Page 12: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Sample Your Data with R

> sample_data <- read.delim(”SampleFiles/E123ABCDEF.2012-05-25-22.NEfbhLN3", header=F)

> sample_data <- sample_data[-1:-2,]

> View(sample_data)

> m <- ggplot(sample_data, aes(x = factor(V9)))

> m + geom_histogram() + scale_y_log10() + xlab('Error Codes') + ylab('log(Frequency)')

Page 13: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Need a Lot of Memory?

Page 14: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

OpenRefine Running on an EC2 Instance

Page 15: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

DATAWAREHOUSE

Web

ANALYST CRM

DB

Logs

OLTP

OLTP

OLAP

E T L

Page 16: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 17: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 18: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 19: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Log Shipping Swedish public domain photo taken in 1918

Page 20: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

“Poor Man’s Log Shipping”

Page 21: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 22: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Embedding Poor-man Invisible Pixel http://www.poor-man-analytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban.com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr=-&utmp=%2F&utmac=UA-7019765-1&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B%2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(referral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analytics-architecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~

Page 23: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 24: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 25: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Open Source

Frameworks

Input Output

+--------------------------------------------+

| |

| Web Apps ---+ +--> File |

| | | |

| +--> ---+ |

| /var/log ------> Fluentd ------> Mail |

| +--> ---+ |

| | | |

| Apache ---+ +--> S3 |

| |

+--------------------------------------------+

Web Server

+---------+

| Fluentd -------+

+---------+ |

|

Proxy Server |

+---------+ +--> +---------+

| Fluentd ----------> | Fluentd |

+---------+ +--> +---------+

|

Database Server |

+---------+ |

| Fluentd -------+

+---------+

Fluentd

Flume

Scribe

Chukwa

Fluentd Ascii Diagrams

Page 26: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Use Amazon Kinesis to Ship Your Logs

New

Page 27: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 28: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Aggregation with S3Distcp Aggregated

Even-size

Compressed

Page 29: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

S3distcp on EMR Job Sample ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \

/home/hadoop/lib/emr-s3distcp-1.0.jar \

--args \

'--src,s3://myawsbucket/cf,\

--dest,s3://myoutputbucket/aggregate ,\

--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,\

--targetSize,128,\

--outputCodec,lzo,\

--deleteOnSuccess'

Page 30: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 31: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 32: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Pig for Access Logs Analysis RAW_LOG = LOAD 's3://myoutputbucket/aggregate/' AS (ts:chararray, url:chararray…);

LOGS_BASE_F = FILTER RAW_LOG BY url MATCHES '^GET /__track.*$’;

LOGS_BASE_F_W_PARAM = FOREACH LOGS_BASE_F GENERATE

url,

DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') as dt,

SUBSTRING(DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') ,0, 10 ) as day,

status,

REGEX_EXTRACT(url, '^GET /([^\\?]+)', 1) AS action: chararray,

REGEX_EXTRACT(url, 'idt=([^&]+)', 1) AS idt: chararray,

REGEX_EXTRACT(url, 'idc=([^&]+)', 1) AS idc: chararray;

I1 = FILTER LOGS_BASE_F_W_PARAM by action == 'clic' or action == 'display';

LOGS_SHORT = FOREACH I1 GENERATE uuid, action, dt, day, ida, idas, act, idp, idcmp ,idc;

G1 = GROUP LOGS_SHORT BY (uuid,idc);

store G1 into ‘s3://mybucket/sessions/’;

Load and Filter

(cat / grep)

Parse

(awk) Store

(>)

Page 33: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Pig vs. Hive

• Pig is geared toward sequentially transforming data

– ETL

– Shell in scale (from local mode to any scale)

• Hive is for querying data

– Data analysis / HQL

– Some transformation, typically as a means to a goal i.e., temporary tables

Page 34: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Monitoring Pig

https://github.com/netflix/lipstick

Page 35: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Another Monitoring

Tool

https://github.com/twitter/ambrose

Page 36: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Optimize Your EMR Cluster

Page 37: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Monitor Your EMR Cluster

Page 38: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Bootstrap Actions --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia

Page 39: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Management Console

Page 40: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 41: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Customers Tools

Gathering information about EMR

jobs from multiple sources and

presentation it in a textual and

graphic view

github.com/Hi-Media/EmrMonitoring

Page 42: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Completed Job View

Page 43: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 44: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 45: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Spot Bidding Strategies

Most Saving

Not paying

more

Less

Interruptions

Page 46: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Jeff Bezos (early Amazon days)

Page 47: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Data Sources

Queries

Value

Page 48: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

More Trends to Consider

Transactional Processing Analytical Processing

Transactional context Global context

Latency Throughput

Indexed access Full table scans

Random IO Sequential IO

Disk seek times Disk transfer rate

Page 49: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 50: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 51: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

COPY into Amazon Redshift create table cf_logs

( d date, t char(8), edge char(4), bytes int, cip varchar(15),

verb char(3), distro varchar(MAX), object varchar(MAX), status int,

Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )

copy cf_logs from 's3://big-data/logs/E123ABCDEF/'

credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'

IGNOREHEADER 2

GZIP

DELIMITER '\t'

DATEFORMAT 'YYYY-MM-DD'

Page 52: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

COPY into Amazon Redshift with

AWS Data Pipeline

Page 53: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Time for Data Visualization

Charles Minard's flow map of Napoleon's March (1869)

Page 54: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 55: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Choose Your Favorite

Visualization Tool

Tableau (Windows instance)

R

Jaspersoft

QlikView

MicroStrategy

SiSense

Page 56: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 57: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 58: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 59: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Snapshot before Delete

Page 60: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Unload Data from Amazon Redshift unload (“select * from cf_logs where date between '2013-11-03’ and '2013-11-10’“)

to 's3://mybucket/unload_cf_logs_week_46'

credentials 'aws_access_key_id=<key_id>;

aws_secret_access_key=<secret_key>’

delimiter as '\t’

GZIP;

Page 61: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Reference Architecture

Page 62: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Partner Services

Loggly

Splunk

Stratalux (Logstash)

Loggly AWS Marketplace Page

Page 63: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

What Else Can You Do with

Log Analysis?

Page 64: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 65: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Finally, a Small Warning

Abraham Wald (1902-1950)

Page 66: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

A B C

Page 67: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Page 68: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Would You Like to Know More?

Further reading http://aws.amazon.com/architecture

http://aws.amazon.com/articles

http://aws.typepad.com

Re:invent sessions DAT205 - Amazon Redshift in Action: Enterprise, Big Data, and SaaS

DAT305 - Getting Maximum Performance from Amazon Redshift

BDT301 - Scaling your Analytics with Amazon Elastic MapReduce

Page 69: Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

ARC306