Upload
amazon-web-services
View
763
Download
5
Embed Size (px)
DESCRIPTION
We will explore the strengths and limitations of Hadoop for analyzing large data sets and review the growing ecosystem of tools for augmenting, extending, or replacing Hadoop MapReduce. We will introduce the Amazon Elastic MapReduce (EMR) platform as the big data foundation for Hadoop and beyond by providing specific examples of running Machine Learning (Mahout), Graph Analytics (Giraph), and Statistical Analysis (R) on EMR. We will discuss also big data analytics and visualization of results with Amazon Redshift + third party business intelligence tools, as well as typical end-to-end Big Data workflow on AWS. We will conclude with real-world examples from ICAO of Big Data analytics for aviation safety data on AWS. The integrated Safety Trend Analysis and Reporting System (iSTARS) is a web based system linking a collection of safety datasets and related web application to perform online safety and risk analysis. It uses AWS EC2, S3, EMR and related partner tools for continuous data aggregation and filtering.
Citation preview
scale to infinityBig Data constraints
strengths or limitationsHadoop ecosystem
real-time analyticsBig Data partner solutions
workflow automation
Building the Square Kilometer Array (SKA) - the Biggest Radio Telescope
SKA will process as much data every day as the world currently produces in a year
Using AWS and crowd-sourced CPUs to analyze 400-500 galaxies simultaneously
Mobile / Cable Telecom
Oil & Gas Industrial
Manufacturing
Retail/Consumer Entertainment
Hospitality
Life Sciences Scientific
Exploration
Financial Services
Publishing Media
Advertising
Online Media Social Network
Gaming
Unstructured
data growth
explosive, with
estimates of
compound
annual growth
(CAGR) at 62%
Source: IDC
Data volume - Gap
1990 2000 2010 2020
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data
Remove Constraints
100 instances
x 1 hour
Remove Constraints
1000 instances
x 1 hour
No upfront capital
On-demand services
Elastic and scalable+
+
Pay for what you use+
=
AWS removes constraints
Remove Constraints
Big Data Constraints
• Volume: massive datasets
• Variety: requiring new tools
• Velocity: iterative, experimental
data manipulation and analysis
• Time to results: more critical
than absolute performance
AWS Cloud Computing
• Virtually unlimited resources
• Variety of compute solutions
• Iterative, experimental usage/
deployment of infrastructure
• Get faster results with effective
parallel autonomous projects
One tool to
rule them all
Foundation
ServicesStorage(Object, Block and Archive)
NetworkingSecurity &
Access ControlCompute(VMs, Auto-scaling and Load Balancing)
Infrastructure Regions Availability Zones CDN and Points of Presence
Platform
Services
Databases
Relational
NoSQL
Caching
Analytics
Hadoop
Real-time
Data warehouse
App Services
Queuing
Orchestration
App streaming
Transcoding
Search
Deployment & Management
Containers
Dev/ops Tools
Resource
Templates
Mobile Services
Identity
Sync
Mobile
Analytics
NotificationsData Workflows
Usage
Tracking
Monitoring
and Logs
Enterprise
ApplicationsVirtual Desktops Collaboration and Sharing
Courtesy: http://techblog.netflix.com/2013/01/hadoop-
platform-as-service-in-cloud.html
HDFS
YARN
MapReduce
EMR Cluster
S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Getting Started: http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-overview.html
Put the data
into
Amazon S3
Launch the cluster using the
console, CLI, SDK, or API
You can easily add and
remove nodesYou can also store
everything in HDFS
Get the output from
Amazon S3
References:
http://aws.amazon.com/elasticmapreduce/getting-started/
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html
Hadoop 2.4.0
Hive, Pig, HBase, Impala, Ganglia
encryption
Consistent view
on every cluster node
Basic statistics are suitable for Hadoop
Some other Big Data problems are not suitable for Hadoop:
dependencies
splits are interrelated
access data across splits
iterative computations
Courtesy: http://www.amazon.com/Big-Data-Analytics-Beyond-Hadoop/dp/0133837947/
• Accumulo – cell-based access control NoSQL
• Avro – data serialization system
• Cascading – alternative language APIs on MR
• Cassandra – multi-master NoSQL DB
• Chukwa – data collection system at scale
• Flume – collecting, aggregating, moving logs
• Giraph – iterative graph processing system
• HBase – large table NoSQL DB
• HDFS – distributed file system
• Hive – SQL on MapReduce Data Warehouse
• Mahout – scalable machine learning library
• MapReduce – parallel processing on YARN
• Nutch – web crawler software
• Pig – high-level scripting on MapReduce
• R - statistical computing and graphics
• Spark – general compute engine on YARN
• Sqoop – transferring data to/from RDBMS
• Tez – data-flow programming on YARN
• Thrift – build scalable cross-language services
• ZooKeeper – high-performance coordination
• Cascading – alternative language APIs on MR
• R - statistical computing and graphics
Courtesy: http://www.apache.org/
• Accumulo – cell-based access control NoSQL
• Avro – data serialization system
• Cascading – alternative language APIs on MR
• Cassandra – multi-master NoSQL DB
• Chukwa – data collection system at scale
• Flume – collecting, aggregating, moving logs
• Giraph – iterative graph processing system
• HBase – large table NoSQL DB
• HDFS – distributed file system
• Hive – SQL on MapReduce Data Warehouse
• Mahout – scalable machine learning library
• MapReduce – parallel processing on YARN
• Nutch – web crawler software
• Pig – high-level scripting on MapReduce
• R - statistical computing and graphics
• Spark – general compute engine on YARN
• Sqoop – transferring data to/from RDBMS
• Tez – data-flow programming on YARN
• Thrift – build scalable cross-language services
• ZooKeeper – high-performance coordination Courtesy: http://www.apache.org/
scripting
statistical analysis
mixture of paradigms
single-machine,
single-thread
Hadoop offers a path to scale R
computation to distributed systems Courtesy: http://www.r-project.org/
http://www.amazon.com/Learning-R-Richard-Cotton/dp/1449357105/
R on every node
Hadoop Streaming
Revolution Analytics
RHadoop
rmr mapreduce()
rhdfs
rhbase
RStudio
http://blogs.aws.amazon.com/bigdata/post/Tx37RSKR
FDQNTSL/Statistical-Analysis-with-Open-Source-R-
and-RStudio-on-Amazon-EMR
http://docs.aws.amazon.com/ElasticMapReduce/la
test/DeveloperGuide/UseCase_Streaming.html
References:
scalable machine
learning library
Collaborative filtering (recommender
engines) e.g. for movies, books, etc.
based on comparing user preferences
Clustering (unsupervised learning)
e.g. identify groupings of related news
stories based on input data properties
Classification (supervised learning or
predictive analytics) – e.g. spam
filtering based on training spam dataCourtesy: http://mahout.apache.org
Mahout
http://mahout.apache.org/users/classification/twenty-newsgroups.html
http://blogs.aws.amazon.com/bigdata/post/Tx1TDK3H
HBD4EZL/Building-a-Recommender-with-Apache-
Mahout-on-Amazon-Elastic-MapReduce-EMR
http://docs.aws.amazon.com/ElasticMapReduce/la
test/DeveloperGuide/ami-versions-supported.html
References:
Developed by Yahoo! based on
Google Pregel (page rank)
Customized by Facebook to scale
on the full friendship graph (~1B
vertices and ~ 100B edges)
Single-vertex-centric API
Bulk Synchronous Parallel machine
Zookeeper enforced atomic barrier
Iterations performed in memory
Runs in mappers, or native YARNCourtesy: http://giraph.apache.org
http://giraph.apache.org/pagerank.html
https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
Single Source Shortest Path Example
values sent as messages (blue)
BSP superstep vertex updates (red)
configure-hadoop
Apache Zookeeper
Giraph http://git-wip-us.apache.org/repos/asf/giraph.git
Maven 3
JAR file
Giraph jar
http://giraph.apache.org/apidocs/org/apache/giraph/examples/SimplePageRankComputation.html
http://giraph.apache.org/quick_start.html
http://giraph.apache.org/build.html
http://docs.aws.amazon.com/ElasticMapReduce/la
test/DeveloperGuide/emr-plan-bootstrap.html
References:
Getting Started: http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html
Leader node (SQL clients,
BI tools access)
PostgreSQL endpoint
Stores metadata
Coordinates queries
Ingestion
Backup
RestoreAmazon S3
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
10 GigE
(HPC)
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
JDBC/ODBC
Leader
Node
Compute nodes
Local, columnar storage
Execute queries in parallel
Amazon S3 load, backup/restore
Integration with Amazon
DynamoDB, EMR, Kinesis
JDBC/ODBC
Connect using drivers from PostgreSQL.org
Amazon Redshift
2 years old all over the world
1900 products 200 allow BYOL
BI tools
MongoDB
on AWS
Architecture
Whitepaper
Running MongoDB on Amazon EC2
Can easily launch a multi-node replica set
Keep JSON templates in source control
https://mongodb-documentation.readthedocs.org/en/latest/ecosystem/tutorial/automate-
deployment-with-cloudformation.html
AWS CloudFormation JSON Templates
AMI in AWS
Marketplace
No extra cost
Running Cloudera EDH on Amazon EC2
Cloudera
on AWS
Product Brief
Cloudera Enterprise Data Hub on AWS
Deploy via Cloudera Director
Manage via Cloudera Managerhttp://aws.amazon.com/about-aws/whats-new/2014/10/15/clouderas-enterprise-
data-hub-edh-on-aws-quick-start/
AWS CloudFormation JSON Templates
Cloudera
Enterprise
Reference
Architecture
on AWS
VPN
Connection
AWS Direct
Connect
AWS CloudCorporate Data center
Amazon S3
logs / files
Source DBs
S3 Multipart
Upload
AWS Import/
Export
Amazon RDS Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Remote
Loading
using
SSH
Amazon Elastic
MapReduce
Amazon EC2 or
On-Premise
Corporate data center
DB
Data Warehouse Extracts
Amazon Redshift
PostgreSQL/ODBC/JDBC
Social media
Amazon EMR/Spark/R/Mahout/Giraph
Sqoop
Hive/Shark - ODBC/JDBC
AWS cloud
Log files and
unstructured data
Hiv
e
Amazon DynamoDB
RS
CO
PY
AWS Data Pipeline
Amazon SWF
Corporate data center
Visualization and analysis
(Tableau, Jaspersoft, etc.)
ODBC/JDBC
AWS cloud
Visualization and
analysis (Tableau,
Jaspersoft, etc.)
Presentation tools
Amazon S3
Gnip Data
Collector
Amazon
Kinesis
Corporate data center
ODBC/JDBC
AWS cloud
Corporate data center
DB
Data Warehouse Extracts
Amazon Redshift
PostgreSQL/ODBC/JDBC
Social media
Hive/Shark - ODBC/JDBC
AWS cloud
Log files and
unstructured data
AWS Data Pipeline
Amazon SWF
Amazon S3
Gnip Data
Collector
Amazon
Kinesis
Cloudera EDH on Amazon EC2/Spark/R/Mahout/Giraph
Sqoop
MongoDB on AWS
Visualization and analysis
(Tableau, Jaspersoft, etc.)
Presentation tools
Corporate data center
DB
Data Warehouse Extracts
PostgreSQL/ODBC/JDBC
Social media
Sqoop
Hive/Shark - ODBC/JDBC
AWS cloud
Log files and
unstructured data
Amazon SWF
Corporate data center
JSON
AWS cloud
Presentation tools
Amazon S3
Gnip Data
Collector
Amazon
Kinesis
MongoDB on AWS
Presentation tools
Amazon EMR/Spark/R/Mahout/Giraph
AWS Data Pipeline
exponentially
removes Big Data constraints (three ‘v’)
Hadoop in the cloud
real-time data analytics
agile Big Data platform
ICAO Headquarters ICAO Regional Office
cloud
cloud
in-house
in-house
synced cloud
Data
Basic UI
Create
Read
Update
Delete
Data
FancyUI
Read
Metrics
Collect Map Reduce Publish
Key Priority
tr -d "\n" | tr -d "\r" |
sed "s#<Accident>#\n<Accident>#g" Amazon S3
Amazon
EC2
Use linux
crontabto schedule
Make one XML
element per line for
Amazon EMR
--put /home/ec2-user/key/newtest.pem
--to /home/hadoop Put ssh key to
hadoop if you need
to remote sh
s3://elasticmapreduce/libs/script-runner/script-runner.jar
Move the results
from Amazon S3 to
somewhere else
Amazon
Elastic MapReduce
Amazon S3
treat
Amazon
Elastic MapReduce
Amazon S3
Amazon EMR Amazon S3Amazon EC2
Learn from AWS big data experts
blogs.aws.amazon.com/bigdata
BDT205: Your First Big Data Application
on AWS
BDT403: Netflix’s Next Generation Big
Data Platform
BDT305: Lessons Learned and Best
Practices for Running Hadoop on AWS
Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals