Big data on_aws in korea by abhishek sinha (lunch and learn)

Big Data Analytics

Abhishek Sinha

Business Development Manager,

AWS

@abysinha

[email protected]

An engineer’s definition

When your data sets become so large that you have to start

innovating how to collect, store, organize, analyze and

share it

What does big data look like ?

Volume

Velocity

Variety

3Vs

Where is this data coming from ?

Human generated

Machine generated

Tweet

Surf the internet

Buy and sell products

Upload images and videos

Play games

Check in at restaurants

Search for cafes

Find deals

Watch content online

Look for directions

Use social media

Human generated

Machine generated

Networks and security devices

Mobile phones

Cell phone towers

Smart grids

Smart meters

Telematics from cars

Sensors on machines

Videos from traffic and security cameras

What are people using this for ?

Big Data Verticals and Use cases

Media/Advertising

Targeted Advertising

Image and Video

Processing

Oil & Gas

Seismic Analysis

Retail

Recommendations

Transactions Analysis

Life Sciences

Genome Analysis

Financial Services

Monte Carlo Simulations

Risk Analysis

Security

Anti-virus

Fraud Detection

Image Recognition

Social Network/Gaming

User Demographi

cs

Usage analysis

In-game metrics

Why is big data hard ?

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Generation




Lower cost,

higher throughput

Generation




Highly

constrained

Lower cost,

higher throughput

Big Gap in turning data into actionable

information

Amazon Web Services helps remove constraints

Big Data + Cloud = Awesome Combination

Big data:

• Potentially massive datasets

• Iterative, experimental style

of data manipulation and

analysis

• Frequently not a steady-state

workload; peaks and valleys

• Data is a combination of

structured and unstructured

data in many formats

AWS Cloud:

• Massive, virtually unlimited

capacity

• Iterative, experimental style of

infrastructure deployment/usage

• At its most efficient with highly

variable workloads

• Tools for managing structured

and unstructured data

Generation




Data size

• Global reach

• Native app for almost every smartphone, SMS, web, mobile-web

• 10M+ users, 15M+ venues, ~1B check-ins

• Terabytes of log data

Stack

Ap

plic

atio

n S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs D

ata

Stac

k

Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Stack – Front end Application

Ap

plic

atio

n S

tack




Databases Logs D

ata

Stac

k




mongoexport

postgres dump Flume

Stack – Collection and Storage

Ap

plic

atio

n S

tack




Databases Logs D

ata

Stac

k




mongoexport

postgres dump Flume

Stack – analysis and sharing

Ap

plic

atio

n S

tack




Databases Logs D

ata

Stac

k




mongoexport

postgres dump Flume

Users Overtime

“Who is using our

service?”

Identified early mobile usage

Invested heavily in mobile development

Finding signal in the noise of logs

9,432,061 unique mobile devices

used the Yelp mobile app.

4 million+ calls. 5 million+ directions.

In January 2013

Autocomplete Search

Recommendations

Automatic spelling

corrections

“What kind of movies do people

like ?”

More than 25 Million Streaming Members

50 Billion Events Per Day

30 Million plays every day

2 billion hours of video in 3 months

4 million ratings per day

3 million searches

Device location , time , day, week etc.

Social data

10 TB of streaming data per day

Data consumed in multiple ways

S3

EMR

Prod Cluster (EMR)

Recommendati

on Engine

Ad-hoc

Analysis

Personalization

AWS

Import/Export

Corporate

data center

Amazon

Elastic

MapReduce

Amazon

Simple

Storage

Service (S3)

BI Users

Clickstream data from

500+ websites and

VoD platform

“Who buys video games?”

Who is Razorfish

• Full service Digital Agency

• Developed an Ad-Serving Platform compatible with most browsers

• Clickstream analysis of data , current historical trends and segmentation of

users

• Segmentation is used to serve ads and cross sell

• 45TB of Log data

• Problems at scale

– Giant Datasets

– Building Infrastructure requires large continuous investment

– Build for peak holiday season

– Traditional Data stores are not scaling

3.5 billion records

13 TB of click stream logs

71 million unique cookies

Per day:

Previously in 2009

Today

Today

This happens in 8 hours everyday

Why AWS + EMR

• Prefect Clarity of Cost

• No upfront infrastructure investment

• No client processing contention

• Without EMR/Hadoop it takes 3 days , with EMR 8 hours

– Scalability 1 node x 100 hours = 100 nodes x 1 hour

• Meet SLA

Playfish improves in-game experience for its users

through data mining

Challenge: Must understand player usage trends across 50M month users, multiple platforms, 10s of games, and in the face of rapid growth. This

drives both in-game improvements and defines what games to target next.

Solution: EMR provides Playfish the flexibility to

experiment and rapidly ask new questions. All usage data is stored in S3 and analysts run ad-hoc hive queries that can slice the

data by time, game, and user.

Data Driven Game Design

Data is being used to understand what gamers are doing inside the game (behavioral analysis)

- What features people like (rely on data instead of forum posts)

- What features are abandoned

- A/B testing

- Monetization – In Game Analytics

Building a big data architecture

Design Patterns

Generation




Generation




Getting your Data into AWS

Amazon S3

Corporate Data Center

• Console Upload

• FTP

• AWS Import Export

• S3 API

• Direct Connect

• Storage Gateway

• 3rd Party Commercial Apps

• Tsunami UDP

1

Write directly to a data source

Your application Amazon S3

DynamoDB

Any other data store

Amazon S3

Amazon EC2

2

Queue , pre-process and then write to data source

Amazon Simple Queue Service

(SQS)

Amazon S3

DynamoDB


3

Agency Customer: Video Analytics on AWS

Elastic Load

Balancer

Edge Servers

on EC2

Workers on

EC2

Logs Reports

HDFS Cluster

Amazon Simple Queue

Service (SQS)

Amazon Simple Storage Service

(S3)

Amazon Elastic MapReduce

Aggregate and write to data source

Flume running

on EC2

Amazon S3


HDFS

4

What is Flume

• Collection, Aggregation of streaming Event Data

– Typically used for log data, sensor data , GPS data etc

• Significant advantages over ad-hoc solutions

– Reliable, Scalable, Manageable, Customizable and High Performance

– Declarative, Dynamic Configuration

– Contextual Routing

– Feature rich

– Fully extensible

Typical Aggregation Flow

[Client]+ Agent [ Agent]* Destination

Flume uses a multi-tier approach where multiple agents can send data to

another agent which acts as a aggregator. For each agent , data can from

either an agent or a client or can be sent to another agent or a sink

Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

S3 as a “single source of truth”

S3

http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html











Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Choose depending upon design

Choice of storage systems (Structure and Volume)

Structure Low High

Large

Small

Size

S3

RDS

Dynamo DB

NoSQL EBS

1

Generation




Hadoop based Analysis

Amazon SQS

Amazon S3

DynamoDB



Amazon

EMR

EMR is Hadoop in the Cloud

What is Amazon Elastic MapReduce (EMR)?

A framework Splits data into pieces Lets processing occur

Gathers the results

distributed computing

Dif

ficu

lty

Number of Machines 1

1

Dif

ficu

lty


1

106

2

Dif

ficu

lty


1

106

2

distributed computing is hard

distributed computing requires god-like engineers

Innovation #1:

Hadoop is… The MapReduce computational paradigm

Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System

Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Person Start End Duration Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 18 Charlie 11:17:24 11:17:38 14 Bob 11:23:10 11:23:25 15 Alice 16:26:46 16:26:54 8 David 17:20:28 17:20:45 17 Alice 18:16:53 18:17:00 7 Charlie 19:33:44 19:33:59 15 Bob 21:13:32 21:13:43 11 David 22:36:22 22:36:34 12 Alice 23:42:01 23:42:11 10

Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10



map


Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17

Person Total

Alice 25



Person Total

Bob 49

Alice 25

Person Total

Charlie 63

Bob 49

Alice 25


Person Total

David 29

Charlie 63

Bob 49

Alice 25


Person Total

David 29

Charlie 63

Bob 49

Alice 25

Person Total Alice 25 Bob 49

Charlie 63 David 29


reduce



map

reduce

Works on one record. In this case it

does “end time minus start time”

In parallel over all the records

Group together common records

(e.g “Alice, Bob”) and add all the

results

Hadoop is… The MapReduce computational paradigm

Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System

distributed computing requires god-like engineers

distributed computing (with Hadoop) requires god-like talented engineers

Launch a Hadoop cluster from the CLI (

elastic-mapreduce --create --alive \

--instance-type m1.xlarge \

--num-instances 5

The Hadoop Ecosystem

EMR makes it easy to use Hive and Pig

Pig:

• High-level programming

language (Pig Latin)

• Supports UDFs

• Ideal for data flow/ETL

Hive:

• Data Warehouse for Hadoop

• SQL-like query language

(HiveQL)

R:

• Language and software

environment for statistical

computing and graphics

• Open source

EMR makes it easy to use other tools and applications

Mahout:

• Machine learning library

• Supports recommendation

mining, clustering,

classification, and frequent

itemset mining

Hive Schema on read

Launch a Hive cluster from the CLI (step 1/1)

./elastic-mapreduce --create --alive \

--name "Test Hive" \

--hadoop-version 0.20 \

--num-instances 5 \

--instance-type m1.large \

--hive-interactive \

--hive-versions 0.7.1

SQL Interface for working with data

Simple way to use Hadoop

Create Table statement references data location on S3

Language called HiveQL, similar to SQL

An example of a query could be: SELECT COUNT(1) FROM sometable;

Requires to setup a mapping to the input data

Uses SerDe:s to make different input formats queryable

Powerful data types (Array & Map..)

SQL HiveQL

Updates UPDATE, INSERT, DELETE

INSERT, OVERWRITE TABLE

Transactions Supported Not supported

Indexes Supported Not supported

Latency Sub-second Minutes

Functions Hundreds Dozens

Multi-table inserts Not supported Supported

Create table as select Not valid SQL-92 Supported

./elastic-mapreduce –create

--name "Hive job flow”

--hive-script

--args s3://myawsbucket/myquery.q

--args -d,INPUT=s3://myawsbucket/input,-

d,OUTPUT=s3://myawsbucket/output

HiveQL to execute

./elastic-mapreduce

--create

--alive

--name "Hive job flow”

--num-instances 5 --instance-type m1.large \

--hive-interactive

Interactive hive session

114

{

requestBeginTime: "19191901901",

requestEndTime: "19089012890",

browserCookie: "xFHJK21AS6HLASLHAS",

userCookie: "ajhlasH6JASLHbas8",

searchPhrase: "digital cameras" adId:

"jalhdahu789asashja",

impresssionId: "hjakhlasuhiouasd897asdh",

referrer: "http://cooking.com/recipe?id=10231",

hostname: "ec2-12-12-12-12.ec2.amazonaws.com",

modelId: "asdjhklasd7812hjkasdhl",

processId: "12901", threadId: "112121",

timers:

{ requestTime: "1910121", modelLookup: "1129101" }

counters:

{ heapSpace: "1010120912012" }

}

http://cooking.com/recipe?id=10231



115

{

requestBeginTime: "19191901901",

requestEndTime: "19089012890",

browserCookie: "xFHJK21AS6HLASLHAS",

userCookie: "ajhlasH6JASLHbas8",

adId: "jalhdahu789asashja",

impresssionId:

hjakhlasuhiouasd897asdh",

clickId: "ashda8ah8asdp1uahipsd",

referrer: "http://recipes.com/",

directedTo: "http://cooking.com/" }

http://recipes.com/

http://cooking.com/

CREATE EXTERNAL TABLE impressions (

requestBeginTime string,

adId string,

impressionId string,

referrer string,

userAgent string,

userCookie string,

ip string

)

PARTITIONED BY (dt string)

ROW FORMAT

serde 'com.amazon.elasticmapreduce.JsonSerde'

with serdeproperties ( 'paths'='requestBeginTime,

adId, impressionId, referrer, userAgent,

userCookie, ip' )

LOCATION ‘s3://mybucketsource/tables/impressions' ;



adId string,


referrer string,

userAgent string,

userCookie string,

ip string

)


ROW FORMAT




userCookie, ip' )


Table structure to create

(happens fast as just mapping to

source)



adId string,


referrer string,

userAgent string,

userCookie string,

ip string

)


ROW FORMAT




userCookie, ip' )


Source data in S3

Hadoop lowers the cost of developing a distributed system.

hive> select * from impressions limit 5;

Selecting from source data directly via Hadoop

What about the cost of operating a distributed system?

November traffic at amazon.com



76%

24%

Innovation #2:

EMR is Hadoop in the Cloud

What is Amazon Elastic MapReduce (EMR)?

1 instance x 100 hours = 100 instances x 1 hour

EMR Cluster

S3

Put the data

into S3

Choose: Hadoop distribution, # of

nodes, types of nodes, custom

configs, Hive/Pig/etc.

Get the output from

S3

Launch the cluster using the

EMR console, CLI, SDK, or

APIs

You can also store

everything in HDFS

How does EMR work ?

S3

What can you run on EMR…

EMR Cluster

Resize Nodes

EMR Cluster

You can easily add and

remove nodes

On and Off Fast Growth

Predictable peaks Variable peaks

WASTE

Fast Growth On and Off

Predictable peaks Variable peaks

Your choice of tools on Hadoop/EMR

Amazon SQS

Amazon S3

DynamoDB



Amazon

EMR

SQL based processing

Amazon SQS

Amazon S3

DynamoDB



Amazon

EMR

Amazon

Redshift

Pre-processing

framework

Petabyte scale

Columnar Data -

warehouse

Massively Parallel Columnar Datawarehouses

• Columnar Data stores

• MPP

– Parallel Ingest

– Parallel Query

– Scale Out

– Parallel Backup

Columnar data stores

• Data alignment and block size in row stores vs. column stores

• Compression based on each column

MPP Data warehouse parallelizes and distributes

everything • Query

• Load

• Backup

• Restore

• Resize

10 GigE

(HPC)

Ingestion

Backup

Restore

JDBC/ODBC

But Data-warehouses are

• Hard to manage

• Very expensive

• Difficult to scale

• Difficult to get performance

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud


Parallelize and Distribute Everything

Dramatically Reduce I/O MPP

Load

Query

Resize

Backup

Restore


Parallelize and Distribute Everything

Dramatically Reduce I/O MPP

Load

Query

Resize

Backup

Restore

Direct-attached storage

Large data block sizes

Column data store

Data compression

Zone maps


Protect Operations

Simplify Provisioning

Redshift data is encrypted

Continuously backed up to S3

Automatic node recovery

Transparent disk failure


Protect Operations

Simplify Provisioning

Redshift data is encrypted

Continuously backed up to S3

Automatic node recovery

Transparent disk failure

Create a cluster in minutes

Automatic OS and software patching

Scale up to 1.6PB with a few clicks and no downtime


Start Small and Grow Big

Extra Large Node (XL)

3 spindles, 2TB, 15GiB RAM

2 virtual cores, 10GigE

1 node (2TB) 2-32 node cluster (64TB)

8 Extra Large Node (8XL)

24 spindles, 16TB, 120GiB RAM

16 virtual cores, 10GigE

2-100 node cluster (1.6PB)


Easy to provision and scale

No upfront costs, pay as you go

High performance at a low price

Open and flexible with support for popular BI tools

Amazon Redshift is priced to let you analyze all your data

Price Per Hour for HS1.XL Single Node

Effective Hourly Price Per TB

Effective Annual Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year Reservation $ 0.500 $ 0.250 $ 2,190

3 Year Reservation $ 0.228 $ 0.114 $ 999

Simple Pricing

Number of Nodes x Cost per Hour

No charge for Leader Node

No upfront costs

Pay as you go

Your choice of BI Tools on the cloud

Amazon SQS

Amazon S3

DynamoDB



Amazon

EMR

Amazon

Redshift

Pre-processing

framework

Generation




Collaboration and Sharing insights

Amazon SQS

Amazon S3

DynamoDB



Amazon

EMR

Amazon

Redshift

Sharing results and visualizations

Amazon SQS

Amazon S3

DynamoDB



Amazon

EMR

Amazon

Redshift

Web App Server

Visualization tools

Sharing results and visualizations and scale

Amazon SQS

Amazon S3

DynamoDB



Amazon

EMR

Amazon

Redshift

Web App Server

Visualization tools

Sharing results and visualizations

Amazon SQS

Amazon S3

DynamoDB



Amazon

EMR

Amazon

Redshift Business

Intelligence Tools

Business

Intelligence Tools

Geospatial Visualizations

Amazon SQS

Amazon S3

DynamoDB



Amazon

EMR

Amazon

Redshift Business

Intelligence Tools

Business

Intelligence Tools

GIS tools on

hadoop

GIS tools

Visualization tools

Rinse Repeat every day or hour

Rinse and Repeat

Amazon SQS

Amazon S3

DynamoDB



Amazon

EMR

Amazon

Redshift

Visualization tools

Business

Intelligence Tools

Business

Intelligence Tools

GIS tools on

hadoop

GIS tools

Amazon data pipeline

The complete architecture

Amazon SQS

Amazon S3

DynamoDB



Amazon

EMR

Amazon

Redshift

Visualization tools

Business

Intelligence Tools

Business

Intelligence Tools

GIS tools on

hadoop

GIS tools

Amazon data pipeline

How do you start ?

Where do you start ?

• Where is your data ? (S3, SQL, NoSQL ?)

– Are you collecting all your data ?

– What is the format (structured or unstructured)

– How much is this data going to grow ?

• How do you want to process it ?

– SQL (HIVE), Scripts (Python/Ruby/Node.JS) On Hadoop ?

• How do you want to use this data

– Visualization tools

• Do you yourself or engage an AWS partner

• Write to me [email protected]

Thank You

[email protected]

Technology

Big data on_aws in korea by abhishek sinha (lunch and learn)