(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

scale to infinityBig Data constraints

strengths or limitationsHadoop ecosystem

real-time analyticsBig Data partner solutions

workflow automation

Building the Square Kilometer Array (SKA) - the Biggest Radio Telescope

SKA will process as much data every day as the world currently produces in a year

Using AWS and crowd-sourced CPUs to analyze 400-500 galaxies simultaneously

Mobile / Cable Telecom

Oil & Gas Industrial

Manufacturing

Retail/Consumer Entertainment

Hospitality

Life Sciences Scientific

Exploration

Financial Services

Publishing Media

Advertising

Online Media Social Network

Gaming

Unstructured

data growth

explosive, with

estimates of

compound

annual growth

(CAGR) at 62%

Source: IDC

Data volume - Gap

1990 2000 2010 2020

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Available for analysis

Generated data

Remove Constraints

100 instances

x 1 hour

Remove Constraints

1000 instances

x 1 hour

No upfront capital

On-demand services

Elastic and scalable+

+

Pay for what you use+

=

AWS removes constraints

Remove Constraints

Big Data Constraints

• Volume: massive datasets

• Variety: requiring new tools

• Velocity: iterative, experimental

data manipulation and analysis

• Time to results: more critical

than absolute performance

AWS Cloud Computing

• Virtually unlimited resources

• Variety of compute solutions

• Iterative, experimental usage/

deployment of infrastructure

• Get faster results with effective

parallel autonomous projects

One tool to

rule them all

Foundation

ServicesStorage(Object, Block and Archive)

NetworkingSecurity &

Access ControlCompute(VMs, Auto-scaling and Load Balancing)

Infrastructure Regions Availability Zones CDN and Points of Presence

Platform

Services

Databases

Relational

NoSQL

Caching

Analytics

Hadoop

Real-time

Data warehouse

App Services

Queuing

Orchestration

App streaming

Transcoding

Email

Search

Deployment & Management

Containers

Dev/ops Tools

Resource

Templates

Mobile Services

Identity

Sync

Mobile

Analytics

NotificationsData Workflows

Usage

Tracking

Monitoring

and Logs

Enterprise

ApplicationsVirtual Desktops Collaboration and Sharing

Courtesy: http://techblog.netflix.com/2013/01/hadoop-

platform-as-service-in-cloud.html

HDFS

YARN

MapReduce

http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

EMR Cluster

S3

Choose: Hadoop distribution, # of

nodes, types of nodes, custom

configs, Hive/Pig/etc.

Getting Started: http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-overview.html

Put the data

into

Amazon S3

Launch the cluster using the

console, CLI, SDK, or API

You can easily add and

remove nodesYou can also store

everything in HDFS

Get the output from

Amazon S3

http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-overview.html

References:

http://aws.amazon.com/elasticmapreduce/getting-started/

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html

Hadoop 2.4.0

Hive, Pig, HBase, Impala, Ganglia

encryption

Consistent view

on every cluster node

http://aws.amazon.com/elasticmapreduce/getting-started/


Basic statistics are suitable for Hadoop

Some other Big Data problems are not suitable for Hadoop:

dependencies

splits are interrelated

access data across splits

iterative computations

Courtesy: http://www.amazon.com/Big-Data-Analytics-Beyond-Hadoop/dp/0133837947/

http://www.amazon.com/Big-Data-Analytics-Beyond-Hadoop/dp/0133837947/

• Accumulo – cell-based access control NoSQL

• Avro – data serialization system

• Cascading – alternative language APIs on MR

• Cassandra – multi-master NoSQL DB

• Chukwa – data collection system at scale

• Flume – collecting, aggregating, moving logs

• Giraph – iterative graph processing system

• HBase – large table NoSQL DB

• HDFS – distributed file system

• Hive – SQL on MapReduce Data Warehouse

• Mahout – scalable machine learning library

• MapReduce – parallel processing on YARN

• Nutch – web crawler software

• Pig – high-level scripting on MapReduce

• R - statistical computing and graphics

• Spark – general compute engine on YARN

• Sqoop – transferring data to/from RDBMS

• Tez – data-flow programming on YARN

• Thrift – build scalable cross-language services

• ZooKeeper – high-performance coordination



Courtesy: http://www.apache.org/

http://www.r-project.org/

• Accumulo – cell-based access control NoSQL

• Avro – data serialization system


• Cassandra – multi-master NoSQL DB

• Chukwa – data collection system at scale

• Flume – collecting, aggregating, moving logs

• Giraph – iterative graph processing system

• HBase – large table NoSQL DB

• HDFS – distributed file system

• Hive – SQL on MapReduce Data Warehouse

• Mahout – scalable machine learning library

• MapReduce – parallel processing on YARN

• Nutch – web crawler software

• Pig – high-level scripting on MapReduce


• Spark – general compute engine on YARN

• Sqoop – transferring data to/from RDBMS

• Tez – data-flow programming on YARN

• Thrift – build scalable cross-language services

• ZooKeeper – high-performance coordination Courtesy: http://www.apache.org/


scripting

statistical analysis

mixture of paradigms

single-machine,

single-thread

Hadoop offers a path to scale R

computation to distributed systems Courtesy: http://www.r-project.org/

http://www.amazon.com/Learning-R-Richard-Cotton/dp/1449357105/


http://www.amazon.com/Learning-R-Richard-Cotton/dp/1449357105/

R on every node

Hadoop Streaming

Revolution Analytics

RHadoop

rmr mapreduce()

rhdfs

rhbase

RStudio

http://blogs.aws.amazon.com/bigdata/post/Tx37RSKR

FDQNTSL/Statistical-Analysis-with-Open-Source-R-

and-RStudio-on-Amazon-EMR

http://docs.aws.amazon.com/ElasticMapReduce/la

test/DeveloperGuide/UseCase_Streaming.html

References:

http://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R-and-RStudio-on-Amazon-EMR

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UseCase_Streaming.html

scalable machine

learning library

Collaborative filtering (recommender

engines) e.g. for movies, books, etc.

based on comparing user preferences

Clustering (unsupervised learning)

e.g. identify groupings of related news

stories based on input data properties

Classification (supervised learning or

predictive analytics) – e.g. spam

filtering based on training spam dataCourtesy: http://mahout.apache.org

http://www.apache.org/http:/mahout.apache.org

Mahout

http://mahout.apache.org/users/classification/twenty-newsgroups.html

http://blogs.aws.amazon.com/bigdata/post/Tx1TDK3H

HBD4EZL/Building-a-Recommender-with-Apache-

Mahout-on-Amazon-Elastic-MapReduce-EMR


test/DeveloperGuide/ami-versions-supported.html

References:

http://mahout.apache.org/users/classification/twenty-newsgroups.html

http://blogs.aws.amazon.com/bigdata/post/Tx1TDK3HHBD4EZL/Building-a-Recommender-with-Apache-Mahout-on-Amazon-Elastic-MapReduce-EMR


Developed by Yahoo! based on

Google Pregel (page rank)

Customized by Facebook to scale

on the full friendship graph (~1B

vertices and ~ 100B edges)

Single-vertex-centric API

Bulk Synchronous Parallel machine

Zookeeper enforced atomic barrier

Iterations performed in memory

Runs in mappers, or native YARNCourtesy: http://giraph.apache.org

http://giraph.apache.org/pagerank.html

https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920

Single Source Shortest Path Example

values sent as messages (blue)

BSP superstep vertex updates (red)

http://giraph.apache.org/

http://giraph.apache.org/pagerank.html

https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920

configure-hadoop

Apache Zookeeper

Giraph http://git-wip-us.apache.org/repos/asf/giraph.git

Maven 3

JAR file

Giraph jar

http://giraph.apache.org/apidocs/org/apache/giraph/examples/SimplePageRankComputation.html

http://giraph.apache.org/quick_start.html

http://giraph.apache.org/build.html


test/DeveloperGuide/emr-plan-bootstrap.html

References:

http://git-wip-us.apache.org/repos/asf/giraph.git

https://giraph.apache.org/apidocs/org/apache/giraph/examples/SimplePageRankComputation.html



http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop

Getting Started: http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html

Leader node (SQL clients,

BI tools access)

PostgreSQL endpoint

Stores metadata

Coordinates queries

Ingestion

Backup

RestoreAmazon S3

128GB RAM

16TB disk

16 coresCompute

Node

128GB RAM

16TB disk

16 coresCompute

Node

128GB RAM

16TB disk

16 coresCompute

Node

10 GigE

(HPC)

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

JDBC/ODBC

Leader

Node

Compute nodes

Local, columnar storage

Execute queries in parallel

Amazon S3 load, backup/restore

Integration with Amazon

DynamoDB, EMR, Kinesis

http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html

JDBC/ODBC

Connect using drivers from PostgreSQL.org

Amazon Redshift

2 years old all over the world

1900 products 200 allow BYOL

BI tools

MongoDB

on AWS

Architecture

Whitepaper

Running MongoDB on Amazon EC2

Can easily launch a multi-node replica set

Keep JSON templates in source control

https://mongodb-documentation.readthedocs.org/en/latest/ecosystem/tutorial/automate-

deployment-with-cloudformation.html

AWS CloudFormation JSON Templates

AMI in AWS

Marketplace

No extra cost

https://mongodb-documentation.readthedocs.org/en/latest/ecosystem/tutorial/automate-deployment-with-cloudformation.html

Running Cloudera EDH on Amazon EC2

Cloudera

on AWS

Product Brief

Cloudera Enterprise Data Hub on AWS

Deploy via Cloudera Director

Manage via Cloudera Managerhttp://aws.amazon.com/about-aws/whats-new/2014/10/15/clouderas-enterprise-

data-hub-edh-on-aws-quick-start/

AWS CloudFormation JSON Templates

Cloudera

Enterprise

Reference

Architecture

on AWS

http://aws.amazon.com/about-aws/whats-new/2014/10/15/clouderas-enterprise-data-hub-edh-on-aws-quick-start/

VPN

Connection

AWS Direct

Connect

AWS CloudCorporate Data center

Amazon S3

logs / files

Source DBs

S3 Multipart

Upload

AWS Import/

Export

Amazon RDS Amazon

Glacier

Amazon

Kinesis

Amazon

DynamoDB

Amazon

Redshift

Remote

Loading

using

SSH

Amazon Elastic

MapReduce

Amazon EC2 or

On-Premise

Corporate data center

DB

Data Warehouse Extracts

Amazon Redshift

PostgreSQL/ODBC/JDBC

Social media

Amazon EMR/Spark/R/Mahout/Giraph

Sqoop

Hive/Shark - ODBC/JDBC

AWS cloud

Log files and

unstructured data

Hiv

e

Amazon DynamoDB

RS

CO

PY

AWS Data Pipeline

Amazon SWF


Visualization and analysis

(Tableau, Jaspersoft, etc.)

ODBC/JDBC

AWS cloud

Visualization and

analysis (Tableau,

Jaspersoft, etc.)

Presentation tools

Amazon S3

Gnip Data

Collector

Amazon

Kinesis


ODBC/JDBC

AWS cloud


DB


Amazon Redshift


Social media


AWS cloud

Log files and

unstructured data

AWS Data Pipeline

Amazon SWF

Amazon S3

Gnip Data

Collector

Amazon

Kinesis

Cloudera EDH on Amazon EC2/Spark/R/Mahout/Giraph

Sqoop

MongoDB on AWS

Visualization and analysis

(Tableau, Jaspersoft, etc.)

Presentation tools


DB



Social media

Sqoop


AWS cloud

Log files and

unstructured data

Amazon SWF


JSON

AWS cloud

Presentation tools

Amazon S3

Gnip Data

Collector

Amazon

Kinesis

MongoDB on AWS

Presentation tools

Amazon EMR/Spark/R/Mahout/Giraph

AWS Data Pipeline

exponentially

removes Big Data constraints (three ‘v’)

Hadoop in the cloud

real-time data analytics

agile Big Data platform

ICAO Headquarters ICAO Regional Office

cloud

cloud

in-house

in-house

synced cloud

Data

Basic UI

Create

Read

Update

Delete

Data

FancyUI

Read

Metrics

Collect Map Reduce Publish

Key Priority

tr -d "\n" | tr -d "\r" |

sed "s#<Accident>#\n<Accident>#g" Amazon S3

Amazon

EC2

Use linux

crontabto schedule

Make one XML

element per line for

Amazon EMR

--put /home/ec2-user/key/newtest.pem

--to /home/hadoop Put ssh key to

hadoop if you need

to remote sh

s3://elasticmapreduce/libs/script-runner/script-runner.jar

Move the results

from Amazon S3 to

somewhere else

Amazon

Elastic MapReduce

Amazon S3

treat

Amazon

Elastic MapReduce

Amazon S3

Amazon EMR Amazon S3Amazon EC2

Learn from AWS big data experts

blogs.aws.amazon.com/bigdata

BDT205: Your First Big Data Application

on AWS

BDT403: Netflix’s Next Generation Big

Data Platform

BDT305: Lessons Learned and Best

Practices for Running Hadoop on AWS

Please give us your feedback on this session.

Complete session evaluations and earn re:Invent swag.

http://bit.ly/awsevals

http://bit.ly/awsevals

Technology

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014