50

BDT201 AWS Data Pipeline - AWS re: Invent 2012

Embed Size (px)

DESCRIPTION

In this session, we'll review the features and architecture of the new AWS Data Pipeline service and explain how you can use it to better manage your data-driven workloads. We'll then go over a few examples of setting up and provisioning a pipeline in the system.

Citation preview

Page 1: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 2: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 3: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 4: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 5: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 6: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Amazon S3

Amazon

DynamoDB

Amazon

RDS

Amazon

Redshift

On

Premise

HDFS

(Amazon EMR)

Page 7: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Amazon DynamoDB Amazon S3

Page 8: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 9: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Amazon S3

Amazon

DynamoDB

Amazon

RDS

Amazon

Redshift

On

Premise

HDFS

(Amazon EMR)

Page 10: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Amazon S3

Amazon

DynamoDB

Amazon

RDS

Amazon

Redshift

On

Premise

HDFS

(Amazon EMR)

Page 11: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Amazon S3

Amazon

DynamoDB

Amazon

RDS

Amazon

Redshift

On

Premise

HDFS

(Amazon EMR)

Page 12: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Amazon S3

Amazon

DynamoDB

Amazon

RDS

Amazon

Redshift

On

Premise

HDFS

(Amazon EMR)

Page 13: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Amazon S3

Amazon

DynamoDB

Amazon

RDS

Amazon

Redshift

On

Premise

HDFS

(Amazon EMR)

Page 14: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 15: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 16: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Input Datanode

Activity

[Output Datanode]

Page 17: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Input Datanode with precondition check

Activity with failure & delay notifications

Ouput Datanode

Page 18: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 19: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 20: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Compute Resources

Data Data

Data Stores Data Stores

Page 21: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 22: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Start

Interval

[End]

Page 23: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Noon Today

1 hour

Page 24: BDT201 AWS Data Pipeline - AWS re: Invent 2012

…..

12-1pm

1-2pm

2-3pm

X

Page 25: BDT201 AWS Data Pipeline - AWS re: Invent 2012

…..

12-1pm

1-2pm

2-3pm

1 day X

X

Page 26: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Hourly

Daily

Weekly

Monthly

Yearly

Quarterly

Page 27: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 28: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 29: BDT201 AWS Data Pipeline - AWS re: Invent 2012

S3 logs (hourly) Geolocation data

Per-geography

usage computation

(hourly)

Redshift

results

Page 30: BDT201 AWS Data Pipeline - AWS re: Invent 2012

S3 logs (hourly)

Precondition: files exist

Geolocation data

Precondition: ./geo_available

Per-geography

usage computation

(hourly)

Redshift

results

Page 31: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 32: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Dynamo

event data RDS

demographics

Hive-based

analysis (hourly)

Redshift

results

Page 33: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 34: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Hourly click updates Hourly event analysis

Daily reporting SQL

Page 35: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 36: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Amazon S3

logs

Custom

Precondition

EMR usage-by-geo job

Amazon EC2

report generation

Amazon

DynamoDB

event data

Amazon RDS

demographics

Amazon Redshift

DW table

Amazon

Redshift

DW table

Hive

script

Page 37: BDT201 AWS Data Pipeline - AWS re: Invent 2012

Amazon S3

logs

Custom

Precondition

EMR usage-by-geo job

Amazon EC2

report generation

Amazon

DynamoDB

event data

Amazon RDS

demographics

Amazon Redshift

DW table

Amazon

Redshift

DW table

Hive

script

Page 38: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 39: BDT201 AWS Data Pipeline - AWS re: Invent 2012

We Manage You Manage

EC2

Instances

EMR Clusters On Premise Resources

EC2

Instances

EMR Clusters

Page 40: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 41: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 42: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 43: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 44: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 45: BDT201 AWS Data Pipeline - AWS re: Invent 2012

{

"objects" : [

{

"name" : “My Copy”,

"type" : “Copy Action”,

“input”: {“ref” : “My RDS Data”},

“output”: {“ref” : “My S3 Data”},

”runsOn” : {“ref”: “My Instance”},

"schedule" : { "ref" : “My Schedule" } },

{

"name" : ”My Instance”,

"type" : ”EC2Instance”,

"instanceType" : "m1.small”,

"schedule" : { "ref” : “My Schedule" } },

…..

}

Page 46: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 47: BDT201 AWS Data Pipeline - AWS re: Invent 2012

On AWS On Premise

High

Frequency

$1/month $2.50/month

Low Frequency $.60/month $1.50/month

Page 48: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 49: BDT201 AWS Data Pipeline - AWS re: Invent 2012
Page 50: BDT201 AWS Data Pipeline - AWS re: Invent 2012

We are sincerely eager to

hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation

form when you have a

chance.