Upload
amazon-web-services
View
10.618
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Introducing big data and analytics with Hadoop, Hbase and Amazon Elastic Mapreduce.
Citation preview
Thank you.
Introducing Hadoop3
HBase on AWSg
Introducing Hadoop3
Cost optimizationv
HBase on AWSg
Introducing Hadoop3
Data for competitive advantage.
Customer segmentation, financial modeling, system analysis,line-of-sight,business intelligence...
Using data
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Cost of data generationis falling.
lower cost, increased throughput
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
HIGHLY CONSTRAINED
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Very high barrier to turning data into information.
Move from a data generation challengeto analytics challenge.
Enter the AWS Cloud.
Remove the constraints.
Enable data-driven innovation.
Move to a distributed data approach.
Maturation of two things.
Maturation of two things.
Software for distributed storage and analysis
Maturation of two things.
Software for distributed storage and analysis
Infrastructure for distributed storage and analysis
Frameworks for data-intensive workloads.
Software
Distributed by design.
Platform for data-intensive workloads.
Infrastructure
Distributed by design.
Support the data life cycle.
HIGHLY CONSTRAINED
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower the barrier to entry.
Accelerate time to market and increase agility.
Enable new business opportunities.
Washington Post
NASA
“AWS enables Pfizer to explore di!cult or deep scientific questions in a timely, scalable manner and helps us make better decisions more quickly”
Michael Miller, Pfizer
Introducing Hadoop3
Maturation of two things.
Software for distributed storage and analysis
Infrastructure for distributed storage and analysis
Maturation of two things.
Software for distributed storage and analysis
Infrastructure for distributed storage and analysis
Apache Hadoop
Software for distributed storage and analysis
Implements the map/reduce pattern
Focus on your data
Built for uncertainty
Hadoop provides tools to navigate data
Allows discovery
Query flexibility at scale
Built for flexibility
Java native
Executes code in any language
Just a distribution mechanism
Rich ecosystem
Diverse tools
Machine learning, recommendations, predictive analytics, segmentation, real time analysis
Lots of innovation
But...
A very big project
500k+ lines of code
Challenging to configure and optimize
Undi!erentiated heavy liftingG
Amazon Elastic MapReduce
Amazon Elastic MapReduce
Web service for data processing
Hosted Hadoop
Configured and optimized
Amazon Elastic MapReduce
Job flows
Elastic platform
Maintain clusters or run once and terminate
Debugging tools
Input data
S3
Elastic MapReduce
Code
Input data
S3
Elastic MapReduce
Code Name node
Input data
S3
Elastic MapReduce
Code Name node
Input data
S3
Elastic cluster
Elastic MapReduce
Code Name node
Input data
S3
Elastic cluster
HDFS
Elastic MapReduce
Code Name node
Input data
S3
Elastic cluster
HDFSQueries
+ BIVia JDBC, Pig, Hive
Elastic MapReduce
Code Name node
OutputS3 + SimpleDB
Input data
S3
Elastic cluster
HDFSQueries
+ BIVia JDBC, Pig, Hive
OutputS3 + SimpleDB
Input data
S3
Hadoop all the way down
Amazon Hadoop distribution
HDFS
Streaming interface
Hive, Pig, Mahout, Spark, Shark
Data integration
Optimized and integrated into AWS environment
Reads and writes to S3
Analytics on DynamoDB data
Can process data from any source: Cassandra, Mongo, Couch, Amazon RDS
Data movement
Multi-part upload
Import/Export
AWS Direct Connect
Aspera
Cluster scalability
Resize running job flows
Add capacity for shorter runs
Remove capacity during o! peak hours
Balance scale and cost
Cluster scalability
14 hours remaining
Cluster scalability
7 hours remaining
Cluster scalability
3 hours remaining
Cluster scalability
Steady state Steady stateLarge batch task
Cluster availability
Canonical source of data
Any one in the engineering team
IAM integration
Monitoring
Click stream analysis for retail
3.5 billion records71 million unique cookies1.7 million targeted ads
13 Tb of clickstream logs
Each day
Click stream analysis for retail
Workflow time from 2 days to 8 hours
Procurement time from 2 months to 5 minutes
$13k per month
500% increase return on advertising spend
Months of user click-through data Search terms Ads displayed Premium listing inventory
Amazon S3
Log data stored in Amazon S3
Hadoop Cluster
Amazon EMR Amazon S3
Elastic Map Reduce spins up 200 instance cluster
Hadoop Cluster
Amazon EMR Amazon S3
Find patterns across logs. Write results to S3.
Hadoop in the AWS Cloud
Elastic MapReduce for hosted Hadoop
Optimized, configured, ready to roll
Focus on the business benefit of data
Hadoop all the way down
Maturation of two things.
Software for distributed storage and analysis
Infrastructure for distributed storage and analysis
HBase on AWSg
Vibrant ecosystem
Mahout for machine learning
Mesos for cluster management
Spark for fast analytics
HBase for unstructured data
HBase
NoSQL data store
Runs on top of HDFS
Scalable
Rapid retrieval across large datasets
Architecture
Huge, distributed map/hash
Distributed
Implements Bloom filters
Sortable
Column based
Columns are similar to fields
Rows are records
Built for data
Built to scale across billions of rows
The more data, the better the relative performance
But...
Large, complex project
Running in production can be challenging
Distributed system
Undi!erentiated heavy liftingG
HBase for Elastic MapReduce
Using HBase
Social media firehose
Customer information
Usage and application logs
Hadoop analytics
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Amazon DynamoDB
NoSQL database service
Provisioned throughput
Unlimited storage
Very easy to use
DynamoDB & Amazon EMR
SQL like queries
Query flexibility at scale
Integrate queries across datasets
Hive
NoSQL on the AWS Marketplace
CouchDB
Cassandra
MongoDB
aws.amazon.com/marketplace
Cost optimizationv
Lowered prices 19 times in the past six years.
On-demand
Reserved capacity
100%
Reserved capacity
100%
Reserved capacity
On-demand
100%
Reserved capacity
On-demand
Spot market
$0.08 vs $0.007(yesterday evening)
Reserved Instance Marketplace
Cost optimizationv
HBase on AWSg
Introducing Hadoop3
aws.amazon.com/elasticmapreduceB