55
2013 AWS Worldwide Public Sector Summit Washington, D.C. Big Data in the Cloud: Accelerating Innovation in the Public Sector Jamie KinneyPrincipal Solutions Architect [email protected] @jamiekinney

Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Big Data in the Cloud: Accelerating Innovation in the Public Sector

Jamie Kinney│Principal Solutions Architect

[email protected] │ @jamiekinney

Page 2: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

Technologies and techniques for

working productively with data,

at any scale

BIG DATA

Page 3: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

The more data you collect

The more VALUE you can

derive from it

Bigger is Better!

Page 4: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

YOU DON’T HAVE

THE CHOICE…

Page 5: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

27 TB per day Large Hadron Collider – CERN

Page 6: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the
Page 7: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

GB TB

PB

Compute Storage Big Data

Unconstrained data growth

95% of the 1.2 zettabytes of data in the digital universe is unstructured

70% of of this is user-generated content

Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 – 2012.

Source: IDC

ZB

EB

Page 8: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

Big Data Verticals

Media Advertising

Targeted Advertising

Image and Video

Processing

Oil & Gas

Seismic Analysis

Retail

Recom-mendations

Transaction Analysis

Life Sciences

Genome Analysis

Financial

Services

Monte Carlo

Simulations

Risk Analysis

Security

Anti-virus

Fraud Detection

Image Recognition

Social Network Gaming

User Demo-graphics

Usage analysis

In-game metrics

Page 9: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

VOLUME

VELOCITY

VARIETY

Page 10: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

COLLECT │ STORE │ ANALYZE │ SHARE

Page 11: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

COLLECT │ STORE │ ANALYZE │ SHARE

Page 12: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the
Page 13: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

AWS

IMPORT / EXPORT

Page 14: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

AWS

Direct Connect

Page 15: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

COLLECT │ STORE │ ANALYZE │ SHARE

Page 16: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

AMAZON S3

Page 17: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Q2 2013

2 Trillion

1.1 M peak transactions per second

Objects in S3

Page 18: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

AMAZON

DYNAMODB

Page 19: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

AMAZON

REDSHIFT

Page 20: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

AMAZON RDS

Page 21: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

HBase on

AMAZON EMR

Page 22: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

COLLECT │ STORE │ ANALYZE │ SHARE

Page 23: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

AMAZON EC2

Page 24: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128

Mem

ory

(GB)

EC2 Compute Units

Instance Types

Standard 2nd Gen Standard Micro High-Memory High-CPU Cluster Compute Cluster GPU High I/O High-Storage Cluster High-Mem

hi1.4xlarge 60.5 GB of memory 35 EC2 Compute Units 2x1024 GB SSD instance storage 64-bit platform

cc1.4xlarge 23 GB of memory 33.5 EC2 Compute Units 1690 GB of instance storage 64-bit platform

c1.xlarge 7 GB of memory 20 EC2 Compute Units 1690 GB of instance storage 64-bit platform

m1.small 1.7 GB memory 1 EC2 Compute Unit 160 GB instance storage 32-bit or 64-bit

m1.medium 3.75 GB memory 2 EC2 Compute Unit 410 GB instance storage 32-bit or 64-bit platform

m1.large EBS Optimizable 7.5 GB memory 4 EC2 Compute Units 850 GB instance storage 64-bit platform

m1.xlarge EBS Optimizable 15 GB memory 8 EC2 Compute Units 1,690 GB instance storage 64-bit platform

m2.xlarge 17.1 GB of memory 6.5 EC2 Compute Units 420 GB of instance storage 64-bit platform

m2.2xlarge 34.2 GB of memory 13 EC2 Compute Units 850 GB of instance storage 64-bit platform

m2.4xlarge EBS Optimizable 68.4 GB of memory 26 EC2 Compute Units 1690 GB of instance storage 64-bit platform

t1.micro 613 MB memory Up to 2 EC2 Compute Units EBS storage only 32-bit or 64-bit platform

c1.medium 1.7 GB of memory 5 EC2 Compute Units 350 GB of instance storage 32-bit or 64-bit platform

cg1.4xlarge 22 GB of memory 33.5 EC2 Compute Units 2 x NVIDIA Tesla “Fermi”  M2050  GPUs 1690 GB of instance storage 64-bit platform

cc2.8xlarge 60.5 GB of memory 88 EC2 Compute Units 3370 GB of instance storage 64-bit platform m3.xlarge

15 GB of memory 13 EC2 Compute Units

m3.2xlarge EBS Optimizable 30 GB of memory 26 EC2 Compute Units

hs1.8xlarge 117 GB of memory 35 EC2 Compute Units 24x2 TB instance storage 64-bit platform

cr1.8xlarge 244 GB of memory 88 EC2 Compute Units 2x120 GB SSD instance storage 64-bit platform

Page 25: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

GPU GRAPHICS PROCESSING UNIT

Page 26: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

Page 27: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

CLUSTER GPU

QUADRUPLE EXTRA LARGE

Intel Xeon X5570, quad-core

Nehalem architecture

NVIDIA Tesla Fermi

M2050 GPUs

22 GB of memory – 1.7 TB of storage

2x

2x

$0.35 / hour (Amazon EC2 Spot)

Page 28: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the
Page 29: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

PARALLELIZATION

Page 30: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

ON A SINGLE INSTANCE

COST: 4h x $2.1 = $8.4

RENDERING TIME: 4h

Page 31: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

ON MULTIPLE INSTANCES

COST: 2 x 2h x $2.1 = $8.4

RENDERING TIME:

Page 32: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

What are Spot Instances?

Availability Zone

Region

Availability Zone

Unused

Unused

Unused

Unused

Unused

Unused

Sold at 50% Discount!

Sold at 56% Discount!

Sold at 66% Discount!

Sold at 59% Discount!

Sold at 54% Discount!

Sold at 63% Discount!

Page 33: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

ON MULTIPLE SPOT INSTANCES

COST: 4 x 1h x $0.35 = $1.4

RENDERING TIME:

Page 34: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

"Hadoop is a reliable storage and data analysis system"

HDFS MapReduce

Page 35: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

Deploying a Hadoop cluster is hard

Page 36: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

AMAZON EMR HADOOP + AWS

Page 37: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

Page 38: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

5/22/2010

7/10/2010

8/28/2010

10/16/2010

12/4/2010

1/22/2011

3/12/2011

4/30/2011

6/18/2011

8/6/2011

9/24/2011

11/12/2011

12/31/2011

2/18/2012

4/7/2012

5/26/2012

7/14/2012

9/1/2012

10/20/2012

12/08/2012

1/26/2013

3/16/2013

Amazon Elastic MapReduce: Clusters launched by customers

Amazon EMR: 5.5M clusters launched by customers since May 2010

Massive Scale

Page 39: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

Page 40: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

USE THE RIGHT TOOL FOR THE RIGHT JOB

RDBMS (Amazon RDS)

Affordable Storage/Compute

Structured or Not (Agility)

Resilient Auto Scalability

Interactive Reporting (<1sec)

Multistep Transactions

Lots of Updates/Deletes

Hadoop (Amazon EMR)

Page 41: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

Expand to

25 instances

Data Warehouse

(Steady State)

Data Warehouse

(Batch Processing)

Shrink to

9 instances

Data Warehouse

(Steady State)

Page 42: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

COLLECT │ STORE │ ANALYZE │ SHARE

Page 43: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

PUBLIC

DATA SETS

http://aws.amazon.com/publicdatasets

Page 44: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the
Page 45: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the
Page 46: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

COLLECT │ STORE │ ANALYZE │ SHARE

Page 47: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

INNOVATE

Page 48: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

« Want to increase innovation?

Lower the cost of failure »

Joi Ito

Page 49: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

AWS LOWERS

THE COST OF INNOVATION Testing a new idea is cheap

Page 50: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

Georgetown University Next-generation sequencing and whole genomics

analysis to identify causation for premature birth

Page 51: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

Solution Overview

Alignment, mapping, variant-calling

Downstream variant analytic pipelines

Hosted data portal including MongoDB

Genomic data storage (raw and processed)

Accessing 1,000 genomes public data set

Page 52: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

SEC MIDAS & Tradeworx Real-time analysis of 20 billion messages/day

Reconstruct any market, any day in history

Page 53: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

Solution Overview

Data Servers

Analytic Servers

Market reconstruction processing

Store historical stock ‘tick’ information

Page 54: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

2013 AWS Worldwide Public Sector Summit

The Results

“For the growing team of quant types now employed at the SEC, MIDAS is

becoming the world’s greatest data sandbox. And the staff is planning to use

it to make the SEC a leader in its use of market data”

Elisse B. Walter, Chairman of the SEC

"This basically propels the SEC from zero to 60 in one fell swoop, going

from being way behind even the most basic market participant to being on par if

not ahead of the vast majority of market participants, in terms of their system and

analytical capabilities’’

Gregg E. Berman, Associate Director of the Office of Analytics and

Research

Page 55: Washington, D.C.d36cz9buwru1tt.cloudfront.net/146CB-300-Big-Data... · 2013 AWS Worldwide Public Sector Summit GB TB PB Storage Big Data Compute Unconstrained data growth 95% of the

Thank You