56

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Embed Size (px)

DESCRIPTION

We will explore the strengths and limitations of Hadoop for analyzing large data sets and review the growing ecosystem of tools for augmenting, extending, or replacing Hadoop MapReduce. We will introduce the Amazon Elastic MapReduce (EMR) platform as the big data foundation for Hadoop and beyond by providing specific examples of running Machine Learning (Mahout), Graph Analytics (Giraph), and Statistical Analysis (R) on EMR. We will discuss also big data analytics and visualization of results with Amazon Redshift + third party business intelligence tools, as well as typical end-to-end Big Data workflow on AWS. We will conclude with real-world examples from ICAO of Big Data analytics for aviation safety data on AWS. The integrated Safety Trend Analysis and Reporting System (iSTARS) is a web based system linking a collection of safety datasets and related web application to perform online safety and risk analysis. It uses AWS EC2, S3, EMR and related partner tools for continuous data aggregation and filtering.

Citation preview

Page 1: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 2: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

scale to infinityBig Data constraints

strengths or limitationsHadoop ecosystem

real-time analyticsBig Data partner solutions

workflow automation

Page 3: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 4: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 5: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Building the Square Kilometer Array (SKA) - the Biggest Radio Telescope

SKA will process as much data every day as the world currently produces in a year

Using AWS and crowd-sourced CPUs to analyze 400-500 galaxies simultaneously

Page 6: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Mobile / Cable Telecom

Oil & Gas Industrial

Manufacturing

Retail/Consumer Entertainment

Hospitality

Life Sciences Scientific

Exploration

Financial Services

Publishing Media

Advertising

Online Media Social Network

Gaming

Page 7: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Unstructured

data growth

explosive, with

estimates of

compound

annual growth

(CAGR) at 62%

Source: IDC

Data volume - Gap

1990 2000 2010 2020

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Available for analysis

Generated data

Page 8: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 9: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Remove Constraints

100 instances

x 1 hour

Page 10: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Remove Constraints

1000 instances

x 1 hour

Page 11: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

No upfront capital

On-demand services

Elastic and scalable+

+

Pay for what you use+

=

AWS removes constraints

Remove Constraints

Page 12: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Big Data Constraints

• Volume: massive datasets

• Variety: requiring new tools

• Velocity: iterative, experimental

data manipulation and analysis

• Time to results: more critical

than absolute performance

AWS Cloud Computing

• Virtually unlimited resources

• Variety of compute solutions

• Iterative, experimental usage/

deployment of infrastructure

• Get faster results with effective

parallel autonomous projects

Page 13: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

One tool to

rule them all

Page 14: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Foundation

ServicesStorage(Object, Block and Archive)

NetworkingSecurity &

Access ControlCompute(VMs, Auto-scaling and Load Balancing)

Infrastructure Regions Availability Zones CDN and Points of Presence

Platform

Services

Databases

Relational

NoSQL

Caching

Analytics

Hadoop

Real-time

Data warehouse

App Services

Queuing

Orchestration

App streaming

Transcoding

Email

Search

Deployment & Management

Containers

Dev/ops Tools

Resource

Templates

Mobile Services

Identity

Sync

Mobile

Analytics

NotificationsData Workflows

Usage

Tracking

Monitoring

and Logs

Enterprise

ApplicationsVirtual Desktops Collaboration and Sharing

Page 15: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 16: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Courtesy: http://techblog.netflix.com/2013/01/hadoop-

platform-as-service-in-cloud.html

HDFS

YARN

MapReduce

Page 17: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

EMR Cluster

S3

Choose: Hadoop distribution, # of

nodes, types of nodes, custom

configs, Hive/Pig/etc.

Getting Started: http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-overview.html

Put the data

into

Amazon S3

Launch the cluster using the

console, CLI, SDK, or API

You can easily add and

remove nodesYou can also store

everything in HDFS

Get the output from

Amazon S3

Page 18: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

References:

http://aws.amazon.com/elasticmapreduce/getting-started/

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html

Hadoop 2.4.0

Hive, Pig, HBase, Impala, Ganglia

encryption

Consistent view

on every cluster node

Page 19: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Basic statistics are suitable for Hadoop

Some other Big Data problems are not suitable for Hadoop:

dependencies

splits are interrelated

access data across splits

iterative computations

Courtesy: http://www.amazon.com/Big-Data-Analytics-Beyond-Hadoop/dp/0133837947/

Page 20: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 21: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

• Accumulo – cell-based access control NoSQL

• Avro – data serialization system

• Cascading – alternative language APIs on MR

• Cassandra – multi-master NoSQL DB

• Chukwa – data collection system at scale

• Flume – collecting, aggregating, moving logs

• Giraph – iterative graph processing system

• HBase – large table NoSQL DB

• HDFS – distributed file system

• Hive – SQL on MapReduce Data Warehouse

• Mahout – scalable machine learning library

• MapReduce – parallel processing on YARN

• Nutch – web crawler software

• Pig – high-level scripting on MapReduce

• R - statistical computing and graphics

• Spark – general compute engine on YARN

• Sqoop – transferring data to/from RDBMS

• Tez – data-flow programming on YARN

• Thrift – build scalable cross-language services

• ZooKeeper – high-performance coordination

• Cascading – alternative language APIs on MR

• R - statistical computing and graphics

Courtesy: http://www.apache.org/

Page 22: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

• Accumulo – cell-based access control NoSQL

• Avro – data serialization system

• Cascading – alternative language APIs on MR

• Cassandra – multi-master NoSQL DB

• Chukwa – data collection system at scale

• Flume – collecting, aggregating, moving logs

• Giraph – iterative graph processing system

• HBase – large table NoSQL DB

• HDFS – distributed file system

• Hive – SQL on MapReduce Data Warehouse

• Mahout – scalable machine learning library

• MapReduce – parallel processing on YARN

• Nutch – web crawler software

• Pig – high-level scripting on MapReduce

• R - statistical computing and graphics

• Spark – general compute engine on YARN

• Sqoop – transferring data to/from RDBMS

• Tez – data-flow programming on YARN

• Thrift – build scalable cross-language services

• ZooKeeper – high-performance coordination Courtesy: http://www.apache.org/

Page 23: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

scripting

statistical analysis

mixture of paradigms

single-machine,

single-thread

Hadoop offers a path to scale R

computation to distributed systems Courtesy: http://www.r-project.org/

http://www.amazon.com/Learning-R-Richard-Cotton/dp/1449357105/

Page 24: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

R on every node

Hadoop Streaming

Revolution Analytics

RHadoop

rmr mapreduce()

rhdfs

rhbase

RStudio

http://blogs.aws.amazon.com/bigdata/post/Tx37RSKR

FDQNTSL/Statistical-Analysis-with-Open-Source-R-

and-RStudio-on-Amazon-EMR

http://docs.aws.amazon.com/ElasticMapReduce/la

test/DeveloperGuide/UseCase_Streaming.html

References:

Page 25: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

scalable machine

learning library

Collaborative filtering (recommender

engines) e.g. for movies, books, etc.

based on comparing user preferences

Clustering (unsupervised learning)

e.g. identify groupings of related news

stories based on input data properties

Classification (supervised learning or

predictive analytics) – e.g. spam

filtering based on training spam dataCourtesy: http://mahout.apache.org

Page 26: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Mahout

http://mahout.apache.org/users/classification/twenty-newsgroups.html

http://blogs.aws.amazon.com/bigdata/post/Tx1TDK3H

HBD4EZL/Building-a-Recommender-with-Apache-

Mahout-on-Amazon-Elastic-MapReduce-EMR

http://docs.aws.amazon.com/ElasticMapReduce/la

test/DeveloperGuide/ami-versions-supported.html

References:

Page 27: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Developed by Yahoo! based on

Google Pregel (page rank)

Customized by Facebook to scale

on the full friendship graph (~1B

vertices and ~ 100B edges)

Single-vertex-centric API

Bulk Synchronous Parallel machine

Zookeeper enforced atomic barrier

Iterations performed in memory

Runs in mappers, or native YARNCourtesy: http://giraph.apache.org

http://giraph.apache.org/pagerank.html

https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920

Single Source Shortest Path Example

values sent as messages (blue)

BSP superstep vertex updates (red)

Page 28: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

configure-hadoop

Apache Zookeeper

Giraph http://git-wip-us.apache.org/repos/asf/giraph.git

Maven 3

JAR file

Giraph jar

http://giraph.apache.org/apidocs/org/apache/giraph/examples/SimplePageRankComputation.html

http://giraph.apache.org/quick_start.html

http://giraph.apache.org/build.html

http://docs.aws.amazon.com/ElasticMapReduce/la

test/DeveloperGuide/emr-plan-bootstrap.html

References:

Page 29: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 30: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Getting Started: http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html

Leader node (SQL clients,

BI tools access)

PostgreSQL endpoint

Stores metadata

Coordinates queries

Ingestion

Backup

RestoreAmazon S3

128GB RAM

16TB disk

16 coresCompute

Node

128GB RAM

16TB disk

16 coresCompute

Node

128GB RAM

16TB disk

16 coresCompute

Node

10 GigE

(HPC)

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

JDBC/ODBC

Leader

Node

Compute nodes

Local, columnar storage

Execute queries in parallel

Amazon S3 load, backup/restore

Integration with Amazon

DynamoDB, EMR, Kinesis

Page 31: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

JDBC/ODBC

Connect using drivers from PostgreSQL.org

Amazon Redshift

Page 32: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 33: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

2 years old all over the world

1900 products 200 allow BYOL

BI tools

Page 34: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

MongoDB

on AWS

Architecture

Whitepaper

Running MongoDB on Amazon EC2

Can easily launch a multi-node replica set

Keep JSON templates in source control

https://mongodb-documentation.readthedocs.org/en/latest/ecosystem/tutorial/automate-

deployment-with-cloudformation.html

AWS CloudFormation JSON Templates

AMI in AWS

Marketplace

No extra cost

Page 35: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Running Cloudera EDH on Amazon EC2

Cloudera

on AWS

Product Brief

Cloudera Enterprise Data Hub on AWS

Deploy via Cloudera Director

Manage via Cloudera Managerhttp://aws.amazon.com/about-aws/whats-new/2014/10/15/clouderas-enterprise-

data-hub-edh-on-aws-quick-start/

AWS CloudFormation JSON Templates

Cloudera

Enterprise

Reference

Architecture

on AWS

Page 36: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 37: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

VPN

Connection

AWS Direct

Connect

AWS CloudCorporate Data center

Amazon S3

logs / files

Source DBs

S3 Multipart

Upload

AWS Import/

Export

Amazon RDS Amazon

Glacier

Amazon

Kinesis

Amazon

DynamoDB

Amazon

Redshift

Remote

Loading

using

SSH

Amazon Elastic

MapReduce

Amazon EC2 or

On-Premise

Page 38: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Corporate data center

DB

Data Warehouse Extracts

Amazon Redshift

PostgreSQL/ODBC/JDBC

Social media

Amazon EMR/Spark/R/Mahout/Giraph

Sqoop

Hive/Shark - ODBC/JDBC

AWS cloud

Log files and

unstructured data

Hiv

e

Amazon DynamoDB

RS

CO

PY

AWS Data Pipeline

Amazon SWF

Corporate data center

Visualization and analysis

(Tableau, Jaspersoft, etc.)

ODBC/JDBC

AWS cloud

Visualization and

analysis (Tableau,

Jaspersoft, etc.)

Presentation tools

Amazon S3

Gnip Data

Collector

Amazon

Kinesis

Page 39: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Corporate data center

ODBC/JDBC

AWS cloud

Corporate data center

DB

Data Warehouse Extracts

Amazon Redshift

PostgreSQL/ODBC/JDBC

Social media

Hive/Shark - ODBC/JDBC

AWS cloud

Log files and

unstructured data

AWS Data Pipeline

Amazon SWF

Amazon S3

Gnip Data

Collector

Amazon

Kinesis

Cloudera EDH on Amazon EC2/Spark/R/Mahout/Giraph

Sqoop

MongoDB on AWS

Visualization and analysis

(Tableau, Jaspersoft, etc.)

Presentation tools

Page 40: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Corporate data center

DB

Data Warehouse Extracts

PostgreSQL/ODBC/JDBC

Social media

Sqoop

Hive/Shark - ODBC/JDBC

AWS cloud

Log files and

unstructured data

Amazon SWF

Corporate data center

JSON

AWS cloud

Presentation tools

Amazon S3

Gnip Data

Collector

Amazon

Kinesis

MongoDB on AWS

Presentation tools

Amazon EMR/Spark/R/Mahout/Giraph

AWS Data Pipeline

Page 41: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

exponentially

removes Big Data constraints (three ‘v’)

Hadoop in the cloud

real-time data analytics

agile Big Data platform

Page 42: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 43: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

ICAO Headquarters ICAO Regional Office

Page 44: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

cloud

cloud

in-house

in-house

synced cloud

Page 45: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Data

Basic UI

Create

Read

Update

Delete

Data

FancyUI

Read

Metrics

Page 46: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 47: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Collect Map Reduce Publish

Key Priority

Page 48: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014
Page 49: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

tr -d "\n" | tr -d "\r" |

sed "s#<Accident>#\n<Accident>#g" Amazon S3

Amazon

EC2

Use linux

crontabto schedule

Make one XML

element per line for

Amazon EMR

Page 50: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

--put /home/ec2-user/key/newtest.pem

--to /home/hadoop Put ssh key to

hadoop if you need

to remote sh

Page 51: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

s3://elasticmapreduce/libs/script-runner/script-runner.jar

Move the results

from Amazon S3 to

somewhere else

Page 52: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Amazon

Elastic MapReduce

Amazon S3

Page 53: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

treat

Amazon

Elastic MapReduce

Amazon S3

Page 54: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Amazon EMR Amazon S3Amazon EC2

Page 55: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Learn from AWS big data experts

blogs.aws.amazon.com/bigdata

BDT205: Your First Big Data Application

on AWS

BDT403: Netflix’s Next Generation Big

Data Platform

BDT305: Lessons Learned and Best

Practices for Running Hadoop on AWS

Page 56: (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Please give us your feedback on this session.

Complete session evaluations and earn re:Invent swag.

http://bit.ly/awsevals