Big Data Analytics

Preview:

DESCRIPTION

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.

Citation preview

Big Data Analytics

Peter Sirota

General Manager, Amazon Elastic MapReduce

1. Introducing Big Data

2. From data to actionable information

3. Analytics and Cloud Computing

4. The Big Data ecosystem

Overview

Introducing Big Data

1

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

The cost of data generation

is falling

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Highly

constrained

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure

Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Generated data

Available for analysis

Data volume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Elastic and highly scalable

No upfront capital expense

Only pay for what you use +

+

Available on-demand

+

= Remove

constraints

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Highly

constrained

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Accelerated

Close the gap.

Technologies and techniques for

working productively with data,

at any scale.

Big Data

From data to

actionable information

2

“Who buys video games?”

3.5 billion records

13 TB of click stream logs

71 million unique cookies

Per day:

500% return on ad spend

17,000% reduction in procurement time

Results:

“Who is using our

service?”

Identified early mobile usage

Invested heavily in mobile development

Finding signal in the noise of logs

9,432,061 unique mobile devices

used the Yelp mobile app.

4 million+ calls. 5 million+ directions.

In January 2013

Open web index.

3.4 billion records.

Available to all.

Full parse for impact of

social networks

300 lines of Ruby code.

14 hours.

$100.

You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011

Tweeting about Flu

Tweets about

the price of rice

Official food

price inflation

Tweeting about Food

Analytics and

Cloud Computing

3

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

S3, Glacier,

Storage Gateway,

DynamoDB,

Redshift, RDS,

HBase

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 &

Elastic MapReduce

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 & S3,

CloudFormation,

Elastic MapReduce,

RDS, DynamoDB, Redshift

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 & S3,

CloudFormation,

Elastic MapReduce,

RDS, DynamoDB, Redshift

EC2 &

Elastic MapReduce

S3, Glacier,

Storage Gateway,

DynamoDB,

Redshift, RDS,

HBase AWS Data Pipeline

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 & S3,

CloudFormation,

Elastic MapReduce,

RDS, DynamoDB, Redshift

EC2 &

Elastic MapReduce

S3, Glacier,

Storage Gateway,

DynamoDB,

Redshift, RDS,

HBase AWS Data Pipeline

Elastic MapReduce

Managed Hadoop analytics

Input data

S3, DynamoDB, Redshift

Elastic

MapReduce

Code

Input data

S3, DynamoDB, Redshift

Elastic

MapReduce

Code Name

node

Input data

S3, DynamoDB, Redshift

Elastic

MapReduce

Code Name

node

Input data

Elastic

cluster

S3, DynamoDB, Redshift

S3/HDFS

Elastic

MapReduce

Code Name

node

Input data

S3/HDFS Queries

+ BI

Via JDBC, Pig, Hive

S3, DynamoDB, Redshift

Elastic

cluster

Elastic

MapReduce

Code Name

node

Output

Input data

Queries

+ BI

Via JDBC, Pig, Hive

S3, DynamoDB, Redshift

Elastic

cluster

S3/HDFS

Output

Input data

S3, DynamoDB, Redshift

1. Elastic clusters

10 hours

6 hours

Peak capacity

2. Rapid, tuned provisioning

Tedious.

Remove undifferentiated

heavy lifting.

3. Hadoop all the way down

Robust ecosystem. Databases, machine learning, segmentation,

clustering, analytics, metadata stores,

exchange formats, and so on...

4. Agility for experimentation

Instance choice. Stay flexible on instance type & number.

5. Cost optimizations

Built for Spot. Name-your-price supercomputing.

1. Elastic clusters

2. Rapid, tuned provisioning

3. Hadoop all the way down

4. Agility for experimentation.

5. Cost optimizations

Vin Sharma vin.sharma@intel.com

Director, Product Strategy & Marketing

Big Data Software, Intel Corporation

Analysis of Data Can Transform Society

Create new business

models and improve

organizational

processes.

Enhance scientific

understanding, drive

innovation, and

accelerate medical cures.

Increase public safety

and improve

energy efficiency with

smart grids.

Intel’s Vision to Democratize Big Data

Unlock Value in

Silicon

Support Open

Platforms

Deliver Software Value

Intel at the Intersection of Big Data

Enabling exascale

computing on massive

data sets

Helping enterprises build open

interoperable clouds

Contributing code and fostering ecosystem

HPC Cloud Open Source

Intel® Technology at the Heart of the Cloud

Server

Storage

Network

Scale-Out Big Data

Compute Platform Optimization

Cost-effective performance

•Intel® Advanced Vector Extension Technology

•Intel® Turbo Boost Technology 2.0

•Intel® Advanced Encryption Standard New

Instructions Technology

73

Intel® Advanced Vector Extensions Technology

• Newest in a long line of

processor instruction

innovations

• Increases floating point

operations per clock up to

2X1 performance

1 : Performance comparison using Linpack benchmark. See backup for configuration details.

For more legal information on performance forecasts go to http://www.intel.com/performance

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel® Turbo Boost Technology 2.0

More Performance Higher turbo speeds maximize

performance for single and

multi-threaded applications

Intel® Advanced Encryption

Standard New Instructions

• Processor assistance for performing AES encryption 7 new instructions

• Makes enabled encryption software faster and stronger

The Power of Intel® Platform Solutions:

Richer

user

experiences

4 HRS

50% Reduction

10 MIN

80% Reduction 50%

Reduction 40% Reduction

TeraSort for

1 TB sort

Intel®

Xeon®

Processor

E5 2600

Solid-State

Drive 10G

Ethernet Intel® Apache

Hadoop

Previous

Intel®

Xeon®

Processor

Cloud

Intelligent Systems

Clients

The Virtuous Cycle of User Experience

The Big Data

Ecosystem

4

Data, data, everywhere... Data is stored in silos.

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On-premises

“How do I get my data to the cloud?”

Data mobility

Generated and stored in AWS

Inbound data transfer is free

Multipart upload to S3

Physical media

AWS Direct Connect

Regional replication of AMIs and snapshots

“How do I integrate my data for

maximum impact?”

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On-premises

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On-premises

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On premises

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On premises

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On premises

AWS Data Pipeline

Announced in November, available now.

Orchestration for data-intensive workloads.

AWS Data Pipeline

Data-intensive orchestration and automation

Reliable and scheduled

Easy to use, drag and drop

Execution and retry logic

Map data dependencies

Create and manage temporary compute

resources

Anatomy of a pipeline

Additional checks and notifications

Arbitrarily complex pipelines

aws.amazon.com/datapipeline

aws.amazon.com/big-data

1. Introducing Big Data

2. From data to actionable information

3. Analytics and Cloud Computing

4. The Big Data ecosystem

Summary

Get 600 Hours of free supercomputing

time!

www.powerof60.com

Thank you!

sirota@amazon.com

Recommended