41
Making Hadoop Enterprise ready with Amazon Elastic Map/Reduce Simone Brunozzi Technology Evangelist, Amazon Web Services, APAC twitter: @simon Blog: www.brunozzi.com

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

Embed Size (px)

Citation preview

Page 1: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

Making Hadoop Enterprise ready with Amazon Elastic Map/Reduce

Simone BrunozziTechnology Evangelist, Amazon Web Services, APACtwitter: @simonBlog: www.brunozzi.com

Page 2: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• What is Elastic MapReduce • Use Cases• Service Features• New Feature Announcements • Elastic MapReduce Ecosystem

AGENDA

Page 3: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Enables customers to easily, securely and cost-effectively process vast amounts of data.– Spin-up 10s or 100s or even 1000s of

instances– Process 10s or 100s of Terabytes of data

• Hosted Hadoop framework running on the web-scale infrastructure of Amazon.

WHAT IS AMAZON ELASTIC MAPREDUCE

Page 4: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Launch and monitor job flows• AWS Management Console• Command line interface • REST API

Page 5: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

WHY USE AMAZON ELASTIC MAPREDUCE

• Elastic MapReduce removes MUCK from Big Data processing–Hard to manage compute clusters–Hard to tune Hadoop–Hard to monitor running Job Flows–Hard to debug Hadoop jobs–Hadoop issues prevent smooth

operation in the cloud

Page 6: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

PROBLEMS CUSTOMERS SOLVE WITH ELASTIC MAPREDUCE

• Data mining and BI– Log processing, click stream analysis, similarities, advertizing

• Data warehousing applications• Bio-informatics (Genome analysis) • Financial simulation (Monte Carlo simulation)• File processing (resize jpegs)• Web indexing

Page 7: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

WEB-SCALE DATA WAREHOUSING

Page 8: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Hadoop 0.20• Pig 0.6• Hive 0.5• Cascading 1.1

ELASTIC MAPREDUCE – SUPPORTED CONFIGURATIONS

• Hadoop 0.18• Pig 0.3• Hive 0.4• Cascading 1.1

Page 9: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Apache Hive – Batch and Interactive Mode– Support Hive Steps– Integration with Elastic MapReduce Client and Management Console– Load table partitions automatically to/from Amazon S3– Optimized data writes to Amazon S3– Reference resources such as streaming scripts located on Amazon S3– Specify an off-instance metadata store – Support variables defined directly in Hive script – Supports JDBC and ODBC connections

ELASTIC MAPREDUCE – HIVE FEATURES

Page 10: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Apache Pig – Batch and interactive mode– Support Pig Steps– Integration with Elastic MapReduce Client and Management Console– Concurrent access to multiple file systems (HDFS, Amazon S3)– Reference resources in Amazon S3 directly from Pig script– Several User Defined Functions in Piggy Bank

ELASTIC MAPREDUCE – PIG FEATURES

Page 11: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi
Page 12: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi
Page 13: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi
Page 14: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi
Page 15: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Enterprise customers need more flexibility– Configuring Clusters– Running Clusters– Paying for clusters

• Enterprise customers need more tools – Application development – Data analytics

• Enterprise customers need support options– Forums support is not enough

AMAZON ELASTIC MAPREDUCE FOR ENTERPRISE

Page 16: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Bootstrap actions– Run arbitrary scripts before job flow begins – Run on all nodes before data processing begins – Used for

• Hadoop configuration (site-conf, Hadoop-conf, etc.)• Cluster configuration (memory, swap, etc.)• Application/packages installation (app-get install r-base)

– Several pre-defined bootstrap actions available

Amazon Elastic MapReduce features

Page 17: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi
Page 18: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Enterprise customers need more flexibility– Configuring Clusters– Running Clusters– Paying for clusters

• Enterprise customers need more tools – Application development – Data analytics

• Enterprise customers need support options– Forum support is not enough

AMAZON ELASTIC MAPREDUCE FOR ENTERPRISE

Page 19: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Preannounce: Expand running clusters– Increase number of nodes in a running

cluster• Increase processing speed• Increasing HDFS size

Amazon Elastic MapReduce - new features

Page 20: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Use Case: Increase speed of running job flows– Speed up job flow execution in response to changing requirements– Dynamically balance cost versus performance without restarting a job

PREANNOUNCE – EXPAND/SHRINK CLUSTERS

Allocate 4 instances

Expand to 25 instances

Expand to 9 instances

Job Flow

Time remaining:

Time remaining:14 Hours

3 Hours

Time remaining:

Job Flow

Job Flow

7 Hours

Page 21: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Shrink running clusters– Decrease number of nodes in a running job flow

• Different capacity requirements from step to step• Automatically regulate capacity between steps

Amazon Elastic MapReduce - new features

Page 22: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Use Case: Agile Data Warehouse Cluster– Customize cluster size to support varying resource needs (e.g., query

support during the day versus batch processing overnight)– Leverage flexibility to reduce costs and increase cluster utilization

EXPAND/SHRINK CLUSTERS

Allocate 9 instances

Expand to 25 instances

Shrink to 9 instances

Data Warehouse(Steady State)

Data Warehouse(Steady State)

Data Warehouse(Batch Processing)

Page 23: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Enterprise customers need more flexibility– Configuring Clusters– Running Clusters– Paying for clusters

• Enterprise customers need more tools – Application development – Data analytics

• Enterprise customers need support options– Forums support is not enough

AMAZON ELASTIC MAPREDUCE FOR ENTERPRISE

Page 24: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

AMAZON ELASTIC MAPREDUCE PRICE

Page 25: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

WHAT IS A SPOT INSTANCE?

• Way to purchase & consume EC2 instances based on compute value

• Reduce your computing costs– Bid for unused EC2 capacity– Control your costs

• Differences from On-Demand Instances:– Request – maximum price bid– Spot Price – what you pay– Termination

Page 26: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

M2.XLARGE INSTANCE PRICING HISTORY

Amazon EC2 On-Demand price for the same instance is $0.50

Page 27: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Spot pricing support for Elastic MapReduce job flows– Specify the price you want to pay for instances– Elastic MapReduce takes care of

• Provisioning• Node addition and removal to/from the cluster

– Can mix On-Demand and Spot instances in the same cluster

Amazon Elastic MapReduce – new feature

Page 28: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Use Case: Manage cost of running job flows– Start with 4 On-Demand instances of type m2.xlarge– Expand the cluster with 5 Spot Nodes

Integration with EC2 Spot

Allocate 4 instances

Expand to 9 instances

Job Flow

Time remaining:14 Hours

Time remaining:

Job Flow

7 Hours

Cost without Spot:4 instances *14 hrs * $0.50 = $28

Cost with Spot:4 instances *7 hrs * $0.50 = $13 +5 instances * 7 hrs * $0.25 = $8.75Total = $21.75

Savings: ~22%

Page 29: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Enterprise customers need more flexibility– Configuring Clusters– Running Clusters– Paying for clusters

• Enterprise customers need more tools – Application development – Data analytics

• Enterprise customers need support options– Forums support is not enough

AMAZON ELASTIC MAPREDUCE FOR ENTERPRISE

Page 30: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

ELASTIC MAPREDUCE ECOSYSTEM

• Ecosystem is growing– Integrated development environments for Hadoop– Tools designed for data analytics

• Broad support for Amazon Elastic MapReduce

Page 31: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Big Data Intelligence software • For developers and analysts

to work faster and easier• Purpose built for all popular

Hadoop distros and versions• Tightly integrated with Elastic

MapReduce (since 2009)• Built on Karmasphere

Application Framework™– Native Hadoop client-side

platform

Karmasphere

Page 32: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Rich graphical environment– Develop, debug and deploy easily

• Visualize, manipulate & diagnose– Jobs, clusters & file systems

• Broad and deep Elastic MapReduce support– Rapid development– Comprehensive profiling– Rich debugging

Karmasphere Studio

• SQL interface for ad hoc analysis• Robust Hive implementation

– Syntax checking, diagnostics, schema browser, JDBC4 compliance, multi-threaded and concurrent

• No cluster changes– Works over proxies and firewalls

• Integrated Hadoop monitoring

Professional Edition Analyst Edition

Free version fromwww.karmasphere.com

Page 33: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

DATAMEER ANALYTICS SOLUTION

• Big data analytics leveraging native Hadoop• Extreme scale and performance• Seamless elastic scale on Amazon Elastic

MapReduce• Empowering business users• UI Driven

– no programming, no modeling, no schema, no ETL

Page 34: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

CRM

Web Logs

Sales

Customer DataExcel Files

Social Media

DATAMEER ANALYTICS SOLUTION

Amazon Elastic MapReduce

Page 35: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Corporate Overview– Founded in 1989– Largest independent public BI vendor (NASDAQ: MSTR)– Positioned in the Gartner “Leader Quadrant” for BI Platforms– Over 1 million business users at over 3,000 organizations

• The MicroStrategy 9 business intelligence platform enables mobile apps, dashboards, reporting and analytics with your business data

• Build once, deliver instantly and securely any time, to any device

MICROSTRATEGY IS A GLOBAL LEADER IN BUSINESS INTELLIGENCE

Page 36: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

WHAT CAN YOU DO WITH MICROSTRATEGY AND AMAZON ELASTIC MAPREDUCE?

• Deliver insights to a broader range of users. – End users interact with a point-and-click interface to query data without writing

HiveQL or MapReduce jobs• Use cases:

– Mobile Apps: Floor manager accesses order details stored in Amazon Elastic MapReduce through a custom iPhone App

– Dashboards: End user starts with a Dynamic Dashboard populated from data mart or data warehouse. The user then drills to a detail report that executes in Amazon Elastic MapReduce.

– Reporting: Application developer builds a parameterized HiveQL report, then schedules it to execute. Jobs execute against Amazon Elastic MapReduce and MicroStrategy sends out exception based alerts via email to end users.

– Analysis: Application developer populates a multidimensional cache in MicroStrategy with results of a HiveQL query. End user uses MicroStrategy’s web interface to slice-and-dice through results without going back to Hadoop.

Page 37: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

HOW CAN I LEARN MORE?

• Try it!– Free MicroStrategy software is available at:

http://www.microstrategy.com/freereportingsoftware• Get More information about Microstrategy solutions

for Amazon Elastic MapReduce http://aws.amazon.com/solutions/solution-providers/microstrategy

Page 38: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Enterprise customers need more flexibility– Configuring Clusters– Running Clusters– Paying for clusters

• Enterprise customers need more tools – Application development – Data analytics

• Enterprise customers need more support options– Forums support is not enough

AMAZON ELASTIC MAPREDUCE FOR ENTERPRISE

Page 39: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

ELASTIC MAPREDUCE - SUPPORT

Page 40: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

• Enterprise customers need more flexibility– Configuring Clusters– Running Clusters– Paying for clusters

• Enterprise customers need more tools – Application development – Data analytics

• Enterprise customers need more support options– Forums support is not enough

AMAZON ELASTIC MAPREDUCE FOR ENTERPRISE

Page 41: Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi

Making Hadoop Enterprise ready with Amazon Elastic Map/Reduce

Simone BrunozziTechnology Evangelist, Amazon Web Services, APACtwitter: @simonBlog: www.brunozzi.com