160
Big Data Analytics Abhishek Sinha Business Development Manager, AWS @abysinha [email protected]

Big data on_aws in korea by abhishek sinha (lunch and learn)

Embed Size (px)

Citation preview

Page 1: Big data on_aws in korea by abhishek sinha (lunch and learn)

Big Data Analytics

Abhishek Sinha

Business Development Manager,

AWS

@abysinha

[email protected]

Page 2: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 3: Big data on_aws in korea by abhishek sinha (lunch and learn)

An engineer’s definition

When your data sets become so large that you have to start

innovating how to collect, store, organize, analyze and

share it

Page 4: Big data on_aws in korea by abhishek sinha (lunch and learn)

What does big data look like ?

Page 5: Big data on_aws in korea by abhishek sinha (lunch and learn)

Volume

Velocity

Variety

3Vs

Page 6: Big data on_aws in korea by abhishek sinha (lunch and learn)

Where is this data coming from ?

Page 7: Big data on_aws in korea by abhishek sinha (lunch and learn)

Human generated

Machine generated

Tweet

Surf the internet

Buy and sell products

Upload images and videos

Play games

Check in at restaurants

Search for cafes

Find deals

Watch content online

Look for directions

Use social media

Page 8: Big data on_aws in korea by abhishek sinha (lunch and learn)

Human generated

Machine generated

Networks and security devices

Mobile phones

Cell phone towers

Smart grids

Smart meters

Telematics from cars

Sensors on machines

Videos from traffic and security cameras

Page 9: Big data on_aws in korea by abhishek sinha (lunch and learn)

What are people using this for ?

Page 10: Big data on_aws in korea by abhishek sinha (lunch and learn)

Big Data Verticals and Use cases

Media/Advertising

Targeted Advertising

Image and Video

Processing

Oil & Gas

Seismic Analysis

Retail

Recommendations

Transactions Analysis

Life Sciences

Genome Analysis

Financial Services

Monte Carlo Simulations

Risk Analysis

Security

Anti-virus

Fraud Detection

Image Recognition

Social Network/Gaming

User Demographi

cs

Usage analysis

In-game metrics

Page 11: Big data on_aws in korea by abhishek sinha (lunch and learn)

Why is big data hard ?

Page 12: Big data on_aws in korea by abhishek sinha (lunch and learn)

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 13: Big data on_aws in korea by abhishek sinha (lunch and learn)

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Page 14: Big data on_aws in korea by abhishek sinha (lunch and learn)

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Highly

constrained

Lower cost,

higher throughput

Page 15: Big data on_aws in korea by abhishek sinha (lunch and learn)

Big Gap in turning data into actionable

information

Page 16: Big data on_aws in korea by abhishek sinha (lunch and learn)

Amazon Web Services helps remove constraints

Page 17: Big data on_aws in korea by abhishek sinha (lunch and learn)

Big Data + Cloud = Awesome Combination

Big data:

• Potentially massive datasets

• Iterative, experimental style

of data manipulation and

analysis

• Frequently not a steady-state

workload; peaks and valleys

• Data is a combination of

structured and unstructured

data in many formats

AWS Cloud:

• Massive, virtually unlimited

capacity

• Iterative, experimental style of

infrastructure deployment/usage

• At its most efficient with highly

variable workloads

• Tools for managing structured

and unstructured data

Page 18: Big data on_aws in korea by abhishek sinha (lunch and learn)

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 19: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 20: Big data on_aws in korea by abhishek sinha (lunch and learn)

Data size

• Global reach

• Native app for almost every smartphone, SMS, web, mobile-web

• 10M+ users, 15M+ venues, ~1B check-ins

• Terabytes of log data

Page 21: Big data on_aws in korea by abhishek sinha (lunch and learn)

Stack

Ap

plic

atio

n S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs D

ata

Stac

k

Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Page 22: Big data on_aws in korea by abhishek sinha (lunch and learn)

Stack – Front end Application

Ap

plic

atio

n S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs D

ata

Stac

k

Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Page 23: Big data on_aws in korea by abhishek sinha (lunch and learn)

Stack – Collection and Storage

Ap

plic

atio

n S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs D

ata

Stac

k

Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Page 24: Big data on_aws in korea by abhishek sinha (lunch and learn)

Stack – analysis and sharing

Ap

plic

atio

n S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs D

ata

Stac

k

Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Page 25: Big data on_aws in korea by abhishek sinha (lunch and learn)

Users Overtime

Page 26: Big data on_aws in korea by abhishek sinha (lunch and learn)

“Who is using our

service?”

Page 27: Big data on_aws in korea by abhishek sinha (lunch and learn)

Identified early mobile usage

Invested heavily in mobile development

Finding signal in the noise of logs

Page 28: Big data on_aws in korea by abhishek sinha (lunch and learn)

9,432,061 unique mobile devices

used the Yelp mobile app.

4 million+ calls. 5 million+ directions.

In January 2013

Page 29: Big data on_aws in korea by abhishek sinha (lunch and learn)

Autocomplete Search

Recommendations

Automatic spelling

corrections

Page 30: Big data on_aws in korea by abhishek sinha (lunch and learn)

“What kind of movies do people

like ?”

Page 31: Big data on_aws in korea by abhishek sinha (lunch and learn)

More than 25 Million Streaming Members

50 Billion Events Per Day

30 Million plays every day

2 billion hours of video in 3 months

4 million ratings per day

3 million searches

Device location , time , day, week etc.

Social data

Page 32: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 33: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 34: Big data on_aws in korea by abhishek sinha (lunch and learn)

10 TB of streaming data per day

Page 35: Big data on_aws in korea by abhishek sinha (lunch and learn)

Data consumed in multiple ways

S3

EMR

Prod Cluster (EMR)

Recommendati

on Engine

Ad-hoc

Analysis

Personalization

Page 36: Big data on_aws in korea by abhishek sinha (lunch and learn)

AWS

Import/Export

Corporate

data center

Amazon

Elastic

MapReduce

Amazon

Simple

Storage

Service (S3)

BI Users

Clickstream data from

500+ websites and

VoD platform

Page 37: Big data on_aws in korea by abhishek sinha (lunch and learn)

“Who buys video games?”

Page 38: Big data on_aws in korea by abhishek sinha (lunch and learn)

Who is Razorfish

• Full service Digital Agency

• Developed an Ad-Serving Platform compatible with most browsers

• Clickstream analysis of data , current historical trends and segmentation of

users

• Segmentation is used to serve ads and cross sell

• 45TB of Log data

• Problems at scale

– Giant Datasets

– Building Infrastructure requires large continuous investment

– Build for peak holiday season

– Traditional Data stores are not scaling

Page 39: Big data on_aws in korea by abhishek sinha (lunch and learn)

3.5 billion records

13 TB of click stream logs

71 million unique cookies

Per day:

Page 40: Big data on_aws in korea by abhishek sinha (lunch and learn)

Previously in 2009

Page 41: Big data on_aws in korea by abhishek sinha (lunch and learn)

Today

Page 42: Big data on_aws in korea by abhishek sinha (lunch and learn)

Today

Page 43: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 44: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 45: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 46: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 47: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 48: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 49: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 50: Big data on_aws in korea by abhishek sinha (lunch and learn)

This happens in 8 hours everyday

Page 51: Big data on_aws in korea by abhishek sinha (lunch and learn)

Why AWS + EMR

• Prefect Clarity of Cost

• No upfront infrastructure investment

• No client processing contention

• Without EMR/Hadoop it takes 3 days , with EMR 8 hours

– Scalability 1 node x 100 hours = 100 nodes x 1 hour

• Meet SLA

Page 52: Big data on_aws in korea by abhishek sinha (lunch and learn)

Playfish improves in-game experience for its users

through data mining

Challenge: Must understand player usage trends across 50M month users, multiple platforms, 10s of games, and in the face of rapid growth. This

drives both in-game improvements and defines what games to target next.

Solution: EMR provides Playfish the flexibility to

experiment and rapidly ask new questions. All usage data is stored in S3 and analysts run ad-hoc hive queries that can slice the

data by time, game, and user.

Page 53: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 54: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 55: Big data on_aws in korea by abhishek sinha (lunch and learn)

Data Driven Game Design

Data is being used to understand what gamers are doing inside the game (behavioral analysis)

- What features people like (rely on data instead of forum posts)

- What features are abandoned

- A/B testing

- Monetization – In Game Analytics

Page 56: Big data on_aws in korea by abhishek sinha (lunch and learn)

Building a big data architecture

Design Patterns

Page 57: Big data on_aws in korea by abhishek sinha (lunch and learn)

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 58: Big data on_aws in korea by abhishek sinha (lunch and learn)

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 59: Big data on_aws in korea by abhishek sinha (lunch and learn)

Getting your Data into AWS

Amazon S3

Corporate Data Center

• Console Upload

• FTP

• AWS Import Export

• S3 API

• Direct Connect

• Storage Gateway

• 3rd Party Commercial Apps

• Tsunami UDP

1

Page 60: Big data on_aws in korea by abhishek sinha (lunch and learn)

Write directly to a data source

Your application Amazon S3

DynamoDB

Any other data store

Amazon S3

Amazon EC2

2

Page 61: Big data on_aws in korea by abhishek sinha (lunch and learn)

Queue , pre-process and then write to data source

Amazon Simple Queue Service

(SQS)

Amazon S3

DynamoDB

Any other data store

3

Page 62: Big data on_aws in korea by abhishek sinha (lunch and learn)

Agency Customer: Video Analytics on AWS

Elastic Load

Balancer

Edge Servers

on EC2

Workers on

EC2

Logs Reports

HDFS Cluster

Amazon Simple Queue

Service (SQS)

Amazon Simple Storage Service

(S3)

Amazon Elastic MapReduce

Page 63: Big data on_aws in korea by abhishek sinha (lunch and learn)

Aggregate and write to data source

Flume running

on EC2

Amazon S3

Any other data store

HDFS

4

Page 64: Big data on_aws in korea by abhishek sinha (lunch and learn)

What is Flume

• Collection, Aggregation of streaming Event Data

– Typically used for log data, sensor data , GPS data etc

• Significant advantages over ad-hoc solutions

– Reliable, Scalable, Manageable, Customizable and High Performance

– Declarative, Dynamic Configuration

– Contextual Routing

– Feature rich

– Fully extensible

Page 65: Big data on_aws in korea by abhishek sinha (lunch and learn)

Typical Aggregation Flow

[Client]+ Agent [ Agent]* Destination

Flume uses a multi-tier approach where multiple agents can send data to

another agent which acts as a aggregator. For each agent , data can from

either an agent or a client or can be sent to another agent or a sink

Page 67: Big data on_aws in korea by abhishek sinha (lunch and learn)

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Choose depending upon design

Page 68: Big data on_aws in korea by abhishek sinha (lunch and learn)

Choice of storage systems (Structure and Volume)

Structure Low High

Large

Small

Size

S3

RDS

Dynamo DB

NoSQL EBS

1

Page 69: Big data on_aws in korea by abhishek sinha (lunch and learn)

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 70: Big data on_aws in korea by abhishek sinha (lunch and learn)

Hadoop based Analysis

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Page 71: Big data on_aws in korea by abhishek sinha (lunch and learn)

EMR is Hadoop in the Cloud

What is Amazon Elastic MapReduce (EMR)?

Page 72: Big data on_aws in korea by abhishek sinha (lunch and learn)

A framework Splits data into pieces Lets processing occur

Gathers the results

Page 73: Big data on_aws in korea by abhishek sinha (lunch and learn)

distributed computing

Page 74: Big data on_aws in korea by abhishek sinha (lunch and learn)

Dif

ficu

lty

Number of Machines 1

1

Page 75: Big data on_aws in korea by abhishek sinha (lunch and learn)

Dif

ficu

lty

Number of Machines 1

1

106

2

Page 76: Big data on_aws in korea by abhishek sinha (lunch and learn)

Dif

ficu

lty

Number of Machines 1

1

106

2

Page 77: Big data on_aws in korea by abhishek sinha (lunch and learn)

distributed computing is hard

Page 78: Big data on_aws in korea by abhishek sinha (lunch and learn)

distributed computing requires god-like engineers

Page 79: Big data on_aws in korea by abhishek sinha (lunch and learn)

Innovation #1:

Page 80: Big data on_aws in korea by abhishek sinha (lunch and learn)

Hadoop is… The MapReduce computational paradigm

Page 81: Big data on_aws in korea by abhishek sinha (lunch and learn)

Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System

Page 82: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Page 83: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Start End Duration Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Page 84: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Page 85: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Page 86: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 18 Charlie 11:17:24 11:17:38 14 Bob 11:23:10 11:23:25 15 Alice 16:26:46 16:26:54 8 David 17:20:28 17:20:45 17 Alice 18:16:53 18:17:00 7 Charlie 19:33:44 19:33:59 15 Bob 21:13:32 21:13:43 11 David 22:36:22 22:36:34 12 Alice 23:42:01 23:42:11 10

Page 87: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10

Page 88: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10

Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

map

Page 89: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10

Page 90: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17

Page 91: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Total

Alice 25

Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17

Page 92: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17

Person Total

Bob 49

Alice 25

Page 93: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Total

Charlie 63

Bob 49

Alice 25

Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17

Page 94: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Total

David 29

Charlie 63

Bob 49

Alice 25

Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17

Page 95: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Total

David 29

Charlie 63

Bob 49

Alice 25

Page 96: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Total Alice 25 Bob 49

Charlie 63 David 29

Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17

reduce

Page 97: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Page 98: Big data on_aws in korea by abhishek sinha (lunch and learn)

Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17

Page 99: Big data on_aws in korea by abhishek sinha (lunch and learn)

map

reduce

Works on one record. In this case it

does “end time minus start time”

In parallel over all the records

Group together common records

(e.g “Alice, Bob”) and add all the

results

Page 100: Big data on_aws in korea by abhishek sinha (lunch and learn)

Hadoop is… The MapReduce computational paradigm

Page 101: Big data on_aws in korea by abhishek sinha (lunch and learn)

Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System

Page 102: Big data on_aws in korea by abhishek sinha (lunch and learn)

distributed computing requires god-like engineers

Page 103: Big data on_aws in korea by abhishek sinha (lunch and learn)

distributed computing (with Hadoop) requires god-like talented engineers

Page 104: Big data on_aws in korea by abhishek sinha (lunch and learn)

Launch a Hadoop cluster from the CLI (

elastic-mapreduce --create --alive \

--instance-type m1.xlarge \

--num-instances 5

Page 105: Big data on_aws in korea by abhishek sinha (lunch and learn)

The Hadoop Ecosystem

Page 106: Big data on_aws in korea by abhishek sinha (lunch and learn)

EMR makes it easy to use Hive and Pig

Pig:

• High-level programming

language (Pig Latin)

• Supports UDFs

• Ideal for data flow/ETL

Hive:

• Data Warehouse for Hadoop

• SQL-like query language

(HiveQL)

Page 107: Big data on_aws in korea by abhishek sinha (lunch and learn)

R:

• Language and software

environment for statistical

computing and graphics

• Open source

EMR makes it easy to use other tools and applications

Mahout:

• Machine learning library

• Supports recommendation

mining, clustering,

classification, and frequent

itemset mining

Page 108: Big data on_aws in korea by abhishek sinha (lunch and learn)

Hive Schema on read

Page 109: Big data on_aws in korea by abhishek sinha (lunch and learn)

Launch a Hive cluster from the CLI (step 1/1)

./elastic-mapreduce --create --alive \

--name "Test Hive" \

--hadoop-version 0.20 \

--num-instances 5 \

--instance-type m1.large \

--hive-interactive \

--hive-versions 0.7.1

Page 110: Big data on_aws in korea by abhishek sinha (lunch and learn)

SQL Interface for working with data

Simple way to use Hadoop

Create Table statement references data location on S3

Language called HiveQL, similar to SQL

An example of a query could be: SELECT COUNT(1) FROM sometable;

Requires to setup a mapping to the input data

Uses SerDe:s to make different input formats queryable

Powerful data types (Array & Map..)

Page 111: Big data on_aws in korea by abhishek sinha (lunch and learn)

SQL HiveQL

Updates UPDATE, INSERT, DELETE

INSERT, OVERWRITE TABLE

Transactions Supported Not supported

Indexes Supported Not supported

Latency Sub-second Minutes

Functions Hundreds Dozens

Multi-table inserts Not supported Supported

Create table as select Not valid SQL-92 Supported

Page 112: Big data on_aws in korea by abhishek sinha (lunch and learn)

./elastic-mapreduce –create

--name "Hive job flow”

--hive-script

--args s3://myawsbucket/myquery.q

--args -d,INPUT=s3://myawsbucket/input,-

d,OUTPUT=s3://myawsbucket/output

HiveQL to execute

Page 113: Big data on_aws in korea by abhishek sinha (lunch and learn)

./elastic-mapreduce

--create

--alive

--name "Hive job flow”

--num-instances 5 --instance-type m1.large \

--hive-interactive

Interactive hive session

Page 114: Big data on_aws in korea by abhishek sinha (lunch and learn)

114

{

requestBeginTime: "19191901901",

requestEndTime: "19089012890",

browserCookie: "xFHJK21AS6HLASLHAS",

userCookie: "ajhlasH6JASLHbas8",

searchPhrase: "digital cameras" adId:

"jalhdahu789asashja",

impresssionId: "hjakhlasuhiouasd897asdh",

referrer: "http://cooking.com/recipe?id=10231",

hostname: "ec2-12-12-12-12.ec2.amazonaws.com",

modelId: "asdjhklasd7812hjkasdhl",

processId: "12901", threadId: "112121",

timers:

{ requestTime: "1910121", modelLookup: "1129101" }

counters:

{ heapSpace: "1010120912012" }

}

Page 115: Big data on_aws in korea by abhishek sinha (lunch and learn)

115

{

requestBeginTime: "19191901901",

requestEndTime: "19089012890",

browserCookie: "xFHJK21AS6HLASLHAS",

userCookie: "ajhlasH6JASLHbas8",

adId: "jalhdahu789asashja",

impresssionId:

hjakhlasuhiouasd897asdh",

clickId: "ashda8ah8asdp1uahipsd",

referrer: "http://recipes.com/",

directedTo: "http://cooking.com/" }

Page 116: Big data on_aws in korea by abhishek sinha (lunch and learn)

CREATE EXTERNAL TABLE impressions (

requestBeginTime string,

adId string,

impressionId string,

referrer string,

userAgent string,

userCookie string,

ip string

)

PARTITIONED BY (dt string)

ROW FORMAT

serde 'com.amazon.elasticmapreduce.JsonSerde'

with serdeproperties ( 'paths'='requestBeginTime,

adId, impressionId, referrer, userAgent,

userCookie, ip' )

LOCATION ‘s3://mybucketsource/tables/impressions' ;

Page 117: Big data on_aws in korea by abhishek sinha (lunch and learn)

CREATE EXTERNAL TABLE impressions (

requestBeginTime string,

adId string,

impressionId string,

referrer string,

userAgent string,

userCookie string,

ip string

)

PARTITIONED BY (dt string)

ROW FORMAT

serde 'com.amazon.elasticmapreduce.JsonSerde'

with serdeproperties ( 'paths'='requestBeginTime,

adId, impressionId, referrer, userAgent,

userCookie, ip' )

LOCATION ‘s3://mybucketsource/tables/impressions' ;

Table structure to create

(happens fast as just mapping to

source)

Page 118: Big data on_aws in korea by abhishek sinha (lunch and learn)

CREATE EXTERNAL TABLE impressions (

requestBeginTime string,

adId string,

impressionId string,

referrer string,

userAgent string,

userCookie string,

ip string

)

PARTITIONED BY (dt string)

ROW FORMAT

serde 'com.amazon.elasticmapreduce.JsonSerde'

with serdeproperties ( 'paths'='requestBeginTime,

adId, impressionId, referrer, userAgent,

userCookie, ip' )

LOCATION ‘s3://mybucketsource/tables/impressions' ;

Source data in S3

Page 119: Big data on_aws in korea by abhishek sinha (lunch and learn)

Hadoop lowers the cost of developing a distributed system.

Page 120: Big data on_aws in korea by abhishek sinha (lunch and learn)

hive> select * from impressions limit 5;

Selecting from source data directly via Hadoop

Page 121: Big data on_aws in korea by abhishek sinha (lunch and learn)

What about the cost of operating a distributed system?

Page 122: Big data on_aws in korea by abhishek sinha (lunch and learn)

November traffic at amazon.com

Page 123: Big data on_aws in korea by abhishek sinha (lunch and learn)

November traffic at amazon.com

Page 124: Big data on_aws in korea by abhishek sinha (lunch and learn)

November traffic at amazon.com

76%

24%

Page 125: Big data on_aws in korea by abhishek sinha (lunch and learn)

Innovation #2:

Page 126: Big data on_aws in korea by abhishek sinha (lunch and learn)

EMR is Hadoop in the Cloud

What is Amazon Elastic MapReduce (EMR)?

Page 127: Big data on_aws in korea by abhishek sinha (lunch and learn)
Page 128: Big data on_aws in korea by abhishek sinha (lunch and learn)

1 instance x 100 hours = 100 instances x 1 hour

Page 129: Big data on_aws in korea by abhishek sinha (lunch and learn)

EMR Cluster

S3

Put the data

into S3

Choose: Hadoop distribution, # of

nodes, types of nodes, custom

configs, Hive/Pig/etc.

Get the output from

S3

Launch the cluster using the

EMR console, CLI, SDK, or

APIs

You can also store

everything in HDFS

How does EMR work ?

Page 130: Big data on_aws in korea by abhishek sinha (lunch and learn)

S3

What can you run on EMR…

EMR Cluster

Page 131: Big data on_aws in korea by abhishek sinha (lunch and learn)

Resize Nodes

EMR Cluster

You can easily add and

remove nodes

Page 132: Big data on_aws in korea by abhishek sinha (lunch and learn)

On and Off Fast Growth

Predictable peaks Variable peaks

WASTE

Page 133: Big data on_aws in korea by abhishek sinha (lunch and learn)

Fast Growth On and Off

Predictable peaks Variable peaks

Page 134: Big data on_aws in korea by abhishek sinha (lunch and learn)

Your choice of tools on Hadoop/EMR

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Page 135: Big data on_aws in korea by abhishek sinha (lunch and learn)

SQL based processing

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Amazon

Redshift

Pre-processing

framework

Petabyte scale

Columnar Data -

warehouse

Page 136: Big data on_aws in korea by abhishek sinha (lunch and learn)

Massively Parallel Columnar Datawarehouses

• Columnar Data stores

• MPP

– Parallel Ingest

– Parallel Query

– Scale Out

– Parallel Backup

Page 137: Big data on_aws in korea by abhishek sinha (lunch and learn)

Columnar data stores

• Data alignment and block size in row stores vs. column stores

• Compression based on each column

Page 138: Big data on_aws in korea by abhishek sinha (lunch and learn)

MPP Data warehouse parallelizes and distributes

everything • Query

• Load

• Backup

• Restore

• Resize

10 GigE

(HPC)

Ingestion

Backup

Restore

JDBC/ODBC

Page 139: Big data on_aws in korea by abhishek sinha (lunch and learn)

But Data-warehouses are

• Hard to manage

• Very expensive

• Difficult to scale

• Difficult to get performance

Page 140: Big data on_aws in korea by abhishek sinha (lunch and learn)

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud

Page 141: Big data on_aws in korea by abhishek sinha (lunch and learn)

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud

Parallelize and Distribute Everything

Dramatically Reduce I/O MPP

Load

Query

Resize

Backup

Restore

Page 142: Big data on_aws in korea by abhishek sinha (lunch and learn)

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud

Parallelize and Distribute Everything

Dramatically Reduce I/O MPP

Load

Query

Resize

Backup

Restore

Direct-attached storage

Large data block sizes

Column data store

Data compression

Zone maps

Page 143: Big data on_aws in korea by abhishek sinha (lunch and learn)

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud

Protect Operations

Simplify Provisioning

Redshift data is encrypted

Continuously backed up to S3

Automatic node recovery

Transparent disk failure

Page 144: Big data on_aws in korea by abhishek sinha (lunch and learn)

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud

Protect Operations

Simplify Provisioning

Redshift data is encrypted

Continuously backed up to S3

Automatic node recovery

Transparent disk failure

Create a cluster in minutes

Automatic OS and software patching

Scale up to 1.6PB with a few clicks and no downtime

Page 145: Big data on_aws in korea by abhishek sinha (lunch and learn)

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud

Start Small and Grow Big

Extra Large Node (XL)

3 spindles, 2TB, 15GiB RAM

2 virtual cores, 10GigE

1 node (2TB) 2-32 node cluster (64TB)

8 Extra Large Node (8XL)

24 spindles, 16TB, 120GiB RAM

16 virtual cores, 10GigE

2-100 node cluster (1.6PB)

Page 146: Big data on_aws in korea by abhishek sinha (lunch and learn)

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud

Easy to provision and scale

No upfront costs, pay as you go

High performance at a low price

Open and flexible with support for popular BI tools

Page 147: Big data on_aws in korea by abhishek sinha (lunch and learn)

Amazon Redshift is priced to let you analyze all your data

Price Per Hour for HS1.XL Single Node

Effective Hourly Price Per TB

Effective Annual Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year Reservation $ 0.500 $ 0.250 $ 2,190

3 Year Reservation $ 0.228 $ 0.114 $ 999

Simple Pricing

Number of Nodes x Cost per Hour

No charge for Leader Node

No upfront costs

Pay as you go

Page 148: Big data on_aws in korea by abhishek sinha (lunch and learn)

Your choice of BI Tools on the cloud

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Amazon

Redshift

Pre-processing

framework

Page 149: Big data on_aws in korea by abhishek sinha (lunch and learn)

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 150: Big data on_aws in korea by abhishek sinha (lunch and learn)

Collaboration and Sharing insights

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Amazon

Redshift

Page 151: Big data on_aws in korea by abhishek sinha (lunch and learn)

Sharing results and visualizations

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Amazon

Redshift

Web App Server

Visualization tools

Page 152: Big data on_aws in korea by abhishek sinha (lunch and learn)

Sharing results and visualizations and scale

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Amazon

Redshift

Web App Server

Visualization tools

Page 153: Big data on_aws in korea by abhishek sinha (lunch and learn)

Sharing results and visualizations

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Amazon

Redshift Business

Intelligence Tools

Business

Intelligence Tools

Page 154: Big data on_aws in korea by abhishek sinha (lunch and learn)

Geospatial Visualizations

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Amazon

Redshift Business

Intelligence Tools

Business

Intelligence Tools

GIS tools on

hadoop

GIS tools

Visualization tools

Page 155: Big data on_aws in korea by abhishek sinha (lunch and learn)

Rinse Repeat every day or hour

Page 156: Big data on_aws in korea by abhishek sinha (lunch and learn)

Rinse and Repeat

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Amazon

Redshift

Visualization tools

Business

Intelligence Tools

Business

Intelligence Tools

GIS tools on

hadoop

GIS tools

Amazon data pipeline

Page 157: Big data on_aws in korea by abhishek sinha (lunch and learn)

The complete architecture

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Amazon

Redshift

Visualization tools

Business

Intelligence Tools

Business

Intelligence Tools

GIS tools on

hadoop

GIS tools

Amazon data pipeline

Page 158: Big data on_aws in korea by abhishek sinha (lunch and learn)

How do you start ?

Page 159: Big data on_aws in korea by abhishek sinha (lunch and learn)

Where do you start ?

• Where is your data ? (S3, SQL, NoSQL ?)

– Are you collecting all your data ?

– What is the format (structured or unstructured)

– How much is this data going to grow ?

• How do you want to process it ?

– SQL (HIVE), Scripts (Python/Ruby/Node.JS) On Hadoop ?

• How do you want to use this data

– Visualization tools

• Do you yourself or engage an AWS partner

• Write to me [email protected]

Page 160: Big data on_aws in korea by abhishek sinha (lunch and learn)

Thank You

[email protected]