Big Data and Hadoop in the Cloud

Jose Papo

Amazon Evangelist

@josepapo @josepapo

HANDS-ON DEMOS

AFTER THE BIG

DATA SESSION

La Nube es el driver de las nuevas tendencias tecnológicas

Accelerating the startup boom

Optimizing the corporate world

#1 ●○○○○

We are sincerely eager to

hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation

form when you have a

chance.

We are constantly producing more data

chance.

From all types of industries

Collect,

Store,

Organize,

Analyze &

27 TB per day Large Hadron Collider – CERN

The Role of Data

is Changing

chance.

Until now, Questions you ask drove Data model

New model is collect as much data as possible – “Data-First Philosophy”

chance.

Data is the new raw material for

any business on par with

capital, people, labor

Data is the new raw material for business on par with capital

& labor

Actionable Information

Generated

Available for analysis

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Data Strategist

1.1M peak requests/sec

lunch hours last year?

select productId, count(*) from page_hits where hour in (12,13) group by productId order by count(*) desc

cat *-(12|13) | cut –f3 | sort | uniq -c > out

Hit <enter>?

1PB = 10^15 (1,000,000,000,000,000) bytes

1 PB = 231 days at 50MB/s

Solution: Massively Parallel Processing

#2 ○●○○○

HDFS Reliable storage

MapReduce Data analysis

Very large log

(e.g TBs)

Very large log

(e.g TBs)

Lots of actions

by John

Very large log

(e.g TBs) Split into

pieces

Lots of actions

by John

Very large log

(e.g TBs)

Process in a

hadoop cluster

Split into

pieces

Lots of actions

by John

Very large log

(e.g TBs)

John’s history

Process in a

hadoop cluster

Aggregate

the results Split into

pieces

Lots of actions

by John

map Input

file reduce Output

Worker node

map Input

file reduce Output

map Input

file reduce Output

map Input

file reduce Output

Worker node

How can we

help John?

Very large log

(e.g TBs) Actionable Insight

Deploying a Hadoop Cluster is Hard

#3 ♥

○○●○○

chance.

Elastic On Demand

Pay as you go

Focus on

business

Elastic On Demand

Pay as you go

Focus on

business

November

Provisioned capacity

November

Provisioned capacity

November

On and Off Fast Growth

Variable Peaks Predictable Peaks

On and Off Fast Growth

Predictable Peaks Variable Peaks

CUSTOMER DISSATISFACTION

Fast Growth On and Off

Predictable peaks Variable peaks

#4 ○○○●○

EMR is Hadoop in the Cloud

Media/Advertising

Targeted Advertising

Image and Video

Processing

Oil & Gas

Seismic Analysis

Retail

Recommendations

Transactions Analysis

Life Sciences

Genome Analysis

Financial Services

Monte Carlo Simulations

Risk Analysis

Security

Anti-virus

Fraud Detection

Image Recognition

Social Network/Gaming

User Demographics

Usage analysis

In-game metrics

1.000.000

2.000.000

3.000.000

4.000.000

5.000.000

6.000.000

Versions

0.20.205

Distributions

Apache Hadoop

Job Flows

Custom JAR

Cascading

Streaming

Ruby, Perl, Python, PHP, R, Bash, C++

Data Warehouse for Hadoop

SQL-like query language

High-level programming

Ideal for data flow / ETL

Near real time key/value

store for structured data

Distributed monitoring

of cluster and nodes

Ganglia

Statistical computing

and graphics

Machine learning library

discover Value in Data

Unknown Unknowns

Elastic On Demand

Pay as you go

Focus on

business

Undifferentiated

Heavy Lifting

Focus on

business

elastic-mapreduce

--create

--key-pair micro

--region eu-west-1

--name MyJobFlow

--num-instances 5

--instance-type m2.4xlarge

–-alive

--log-uri s3n://mybucket/EMR/log

Instance type/count

elastic-mapreduce

--create

--key-pair micro

--region eu-west-1

--name MyJobFlow

--num-instances 5

--instance-type m2.4xlarge

–-alive

--pig-interactive --pig-versions latest

--hive-interactive –-hive-versions latest

--hbase

--log-uri s3n://mybucket/EMR/log

Adding Hive, Pig and

Hbase to the job flow

Elastic On Demand

Pay as you go

Focus on

business

1 instance for 1000 hours

1000 instances for 1 hour

…to Thousands

Turn Off the Resources and Stop Paying

Elastic On Demand

Pay as you go

Focus on

business

Source: IDC Whitepaper, sponsored by Amazon, “The Business Value of Amazon Web Services Accelerates Over Time.” July 2012

70% lower 5 year TCO per app

On-premises

$3.01M

$0.90M

50% reduction in analytics costs

Save more money by using Spot Instances

14 hrs

Without Spot 4 instances * 14 hrs * $0.50 = $28

EMR with Spot Instances

14 hrs

With Spot 4 instances * 7 hrs * $0.50 = $14 +

14 hrs

With Spot 4 instances * 7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75

Total = $22.75

14 hrs

Time -50% Cost -22%

With Spot 4 instances * 7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75

Total = $22.75

14 hrs

#5 ○○○○●

“What kind of movies do people like ?”

More than 25 Million Streaming Members

50 Billion Events Per Day

30 Million plays every day

2 billion hours of video in 3

months

4 million ratings per day

3 million searches

Device location , time ,

day, week etc.

Social data

10 TB of streaming data per day

~1 PB of data stored in Amazon S3

Wide range of processing languages used

Prod Cluster (EMR)S3

Data consumed in multiple ways

Prod Cluster (EMR)

Recommendation

Engine

Ad-hoc

Analysis Personalization

Prod Cluster (EMR)

Query Cluster (EMR)

Durability

Versioning

Foursquare…

33 million users 1.3 million businesses

…generates a lot of Data 3.5 billion check-ins 15M+ venues, Terabytes of log data

Uses EMR for Evaluation of new features

Machine learning

Exploratory analysis

Daily customer usage reporting

Long-term trend analysis

Benefits of EMR

Ease-of-Use “We have decreased the processing time for urgent data-analysis”

Flexibility To deal with changing requirements & dynamically expand reporting clusters

Costs “We have reduced our analytics costs by over 50%”

Applic

ation S

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs D

Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Applic

ation S

Databases Logs D

mongoexport

postgres dump Flume

Applic

ation S

Databases Logs D

mongoexport

postgres dump Flume

Applic

ation S

Databases Logs D

mongoexport

postgres dump Flume

Female Male

Gender

0 10 20 30 40 50 60 70 80

Gorilla Coffee

Gray's Papaya

Amorino

Thursday Friday Saturday Sunday

Python library

https://github.com/Yelp/mrjob

Log files

250 EMR clusters spun up

and down every week

Common Crawl

1000 Genomes Project

Census Data

54 other datasets

http://aws.amazon.com/publicdatasets/

Challenge: Large amounts of computing resources needed for short periods of time; significant data storage costs

Solution: Clusters of 100s of nodes on EMR running 4-5 hours at a time Leverages 1000 genomes Public Data Set on AWS —free access to ~200 TB of genomes for over 2,600 people from 26 populations around the world.

Challenge: Volatile weather is deadly to crops like grapes

Solution: Built a predictive model based on freely available data— 60 years of crop data, 14 TBs of soil data, and 1M government Doppler radar points 50 EMR clusters process new data as it comes into S3 each day, continuously updating the model.

150B Soil

Observations

3M Daily Weather

Measurements

850K Precision Rainfall

Grids Tracked

200 TB in Amazon S3

Big Data and AWS Cloud

Elastic and scalable

No upfront CapEx

Pay per use +

On demand

= Remove

constraints

Remove constraints = More experimentation

More experimentation = More innovation

Focus on your business

Leave undifferentiated heavy lifting to us

GRACIAS!

slideshare.net/AmazonWebServicesLATAM

http://aws.amazon.com/es/big-data/

José Papo

AWS Tech Evangelist

@josepapo

Big Data and Hadoop in the Cloud

Technology

Big Data & Hadoop

JetStor JBOD 80bay HPC Cloud Storage Big Data Hadoop

Introduction to Big Data and Hadoop on Windows Azurefiles.meetup.com/1624468/big data and cloud at Microsoft.pdf · Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions

HADOOP & THE FUTURE OF CLOUD COMPUTINGsalsahpc.indiana.edu/.../Yahoo_business_seminar.pdfHADOOP & THE FUTURE OF CLOUD COMPUTING OF HADOOP THE POWER HAPPENING WHAT’S - Big Data is

Genomics: a journey into the Cloud June 2, 2015. Overview Big Data Big Data in Genomics Enter: The Cloud Cloud Technologies: Hadoop/MapReduce Cloud Technologies:

Addressing Open Source Big Data, Hadoop, and MapReduce ... › library › pdf › forum › 2014 › Presentations › A1_05_I… · Hadoop MapReduce OpenStack Cloud Cluster Management

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab steps

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

Big data cloud solutions | Big data hadoop | Bi on cloud

Going Big (Data) with MapR Hadoop and Cisco UCS · Going Big (Data) with MapR Hadoop and Cisco UCS Liaison Technologies provides cloud-based solutions to help organizations inte-grate,

Big data&hadoop

About this tutorialrossbach/cs378h/papers/hadoop-tutorial.pdfNoSQL Big Data systems are designed to take advantage of new cloud computing ... Hadoop ─ Introduction . Hadoop 8 Hadoop

Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Karachi)

Casablanca Hadoop & Big Data Meetup - Introduction à Hadoop

Cloud Computing: Hadoop

Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13 from the Inevitable Cloud Community

Hadoop, Big Data e Cloud Computing

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop

Big Data & Hadoop

Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing