37
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Pathak, Amazon EMR March 30, 2016 Building Big Data Solutions with Amazon EMR & Amazon Redshift

AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Embed Size (px)

Citation preview

Page 1: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Rahul Pathak, Amazon EMR

March 30, 2016

Building Big Data Solutions with Amazon EMR & Amazon Redshift

Page 2: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Agenda

• AWS Big Data Platform Overview

• Amazon EMR & Amazon Redshift

• Building a Big Data Application

• Customer Use Cases

Page 3: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

AWS Big Data Platform

EMR EC2

Analyze

Glacier

S3

Store

Import Export

Collect

Kinesis

Direct Connect

MachineLearningRedshift

New!

AmazonQuickSight

DynamoDB

Page 4: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Amazon EMR – Managed Hadoop Clusters in the Cloud

Scalable Hadoop clusters as a service

Hadoop, Hive, Spark, Presto, Hbase, etc.

Easy to use; fully managed

On demand, reserved, spot pricing

HDFS, Amazon EBS, and S3 filesystems

End to end security

Amazon EMR

Page 5: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Easy to deploy

AWS Management Console

or use the EMR API with your favorite SDK

Command Line

Page 6: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Choose your instance types

CPUC3/C4 family

MACHINE LEARNING

MemoryR3 family

SPARK AND INTERACTIVE

Disk/IOD2/I2 family

LARGEHDFS

GeneralM3/M4 family

BATCH PROCESS

Customize your storage type and size using Amazon EBS

Try different configurations to find your optimal architecture

Page 7: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

The Hadoop ecosystem can run in Amazon EMR

Page 8: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Integrated with the AWS Platform

Amazon DynamoDB

EMR-DynamoDB connector

Amazon RDS

Amazon Kinesis

Streaming dataconnectorsJDBC Data Source

w/ Spark SQL

ElasticSearchconnector

Amazon Redshift

Amazon Redshift Copy From HDFS

EMR File System(EMRFS)

Amazon S3

Amazon EMR

Page 9: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Amazon S3 as your persistent data store

Amazon S3Designed for 99.999999999% durabilitySeparate compute and storage

Resize and shut down Amazon EMR clusters with no data loss

Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)

Page 10: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

EMRFS makes it easier to leverage Amazon S3

Better performance and error handling options

Transparent to applications – just read/write to “s3://”

Support for Amazon S3 server-side and client-side encryption

Faster listing using EMRFS metadata

HDFS is still available via local instance storage or Amazon EBS

Page 11: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Amazon Redshift

Relational data warehouse

Massively parallel; Petabyte scale

Fully managed

HDD and SSD Platforms

$1,000/TB/Year; start at $0.25/hour

End to end security; built in global DR

Amazon Redshift

Page 12: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Amazon Redshift dramatically reduces I/OColumn storage

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• With row storage you do unnecessary I/O

• To get total amount, you have to read everything

Page 13: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Amazon Redshift dramatically reduces I/OColumn storage

• With column storage, you only read the data you need

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Page 14: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Amazon Redshift dramatically reduces I/OColumn storage

Data compression

• Columnar compression saves space & reduces I/O

• Amazon Redshift analyzes and compresses your data

analyze compression listing;

Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw

Page 15: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Amazon Redshift dramatically reduces I/OColumn storage

Data compression

Zone maps

• Track of the minimum and maximum value for each block

• Skip over blocks that don’t contain the data needed for a given query

• Minimize unnecessary I/O

Page 16: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Amazon Redshift dramatically reduces I/OColumn storage

Data compression

Zone maps

Direct-attached storage

Large data block sizes

• Use direct-attached storage to maximize throughput

• Hardware optimized for high performance data processing

• Large block sizes to make the most of each read

• Amazon Redshift manages durability for you

Page 17: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Amazon Redshift Has Security Built In

SSL to secure data in transit

Encryption to secure data at restAES-256; hardware acceleratedAll blocks on disks and in Amazon S3 encryptedHSM Support

No direct access to compute nodes

Audit logging, AWS CloudTrail, AWS KMS integration

Amazon VPC support

SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3 / DynamoDB / EMR

Customer VPC

InternalVPC

JDBC/ODBC

LeaderNode

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

Page 18: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Building a Big Data Application

Page 19: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Building a Big Data Application

web clients

mobile clients

DBMS

corporate data center

Getting Started

Page 20: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Building a Big Data Application

web clients

mobile clients

DBMS Amazon Redshift

Amazon QuickSight

AWS cloudcorporate data center

Adding a data warehouse

Page 21: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Amazon QuickSight

AWS cloud

Bringing in Log Data

corporate data center

Page 22: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Orc/Parquet(Query optimized)

Amazon QuickSight

AWS cloud

Extending your DW to S3

corporate data center

Page 23: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Orc/Parquet(Query optimized)

Amazon QuickSight

Kinesis Streams

AWS cloud

Adding a real-time layer

corporate data center

Page 24: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Orc/Parquet(Query optimized)

Amazon QuickSight

Kinesis Streams

AWS cloud

Adding predictive analytics

corporate data center

Page 25: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Orc/Parquet(Query optimized)

Amazon QuickSight

Kinesis Streams

AWS cloud

Adding encryption at rest with AWS KMS

corporate data centerAWS KMS

Page 26: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Orc/Parquet(Query optimized)

Amazon QuickSight

Kinesis Streams

AWS cloud

AWS KMS

VPC subnet

SSL/TLS

SSL/TLS

Protecting Data in Transit & Adding Network Isolation

corporate data center

Page 27: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Security

• Encryption at rest with choice of key management• Service managed, AWS KMS, CloudHSM, on premise HSM

• Encryption in Transit• Require SSL, all internal communication over SSL/TLS

• Network isolation using Amazon VPC

• Fine grained permissions and auditing using AWS IAM and AWS CloudTrail

Page 28: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Compliance

ISO 9001

SOC 3

SOC 2

ISO 27001

ISO 27017

PCI DSS Level 1ISO 27018

SOC 1 / ISAE 3402

GxPHIPAA

ITAR

FERPA

FISMA, RMF, and DIACAP

FedRAMP

Section 508 / VPAT

DoD SRG Levels 2 & 4

FIPS 140-2

CJIS

Cloud Security Alliance

MPAA

NIST

MLPS Level 3

G-Cloud

IT-Grundschutz

MTCS Tier 3

IRAP Cyber Essentials Plus

Page 29: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Disaster Recovery

• Amazon EMR & Amazon Redshift clusters are resilient and we automatically replace failed nodes/HW

• Data on S3 available in all Availability Zones in a Region

• S3 data can be synced across regions

• Amazon Redshift clusters are continuously backed up to S3 and snapshots can be synced to a second region

Page 30: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Customer Use Cases

Page 31: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Data Source ET

DirectConnect

Client

Forwarder

LoaderState Management

SandboxRedshift

S3Petabytes of data generated on premise and brought to Redshift in the cloud for analysis

High speed connectivity over a redundant pair of Direct Connect leased lines

Stringent security requirements met by leveraging VPC, VPN, Encryption and Rest and In Transit, CloudTrail and database auditing

NTT DOCOMO

Page 32: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Nasdaq – Legacy Warehouse

Expensive ($1.16M annually)

Limited capacity (1 year of data online)

4-8 billion rows inserted per trading day, storing:

• Orders• Trades• Quotes• Market Data• Security Master• Membership

DW can be used to analyze market share, client activity, surveillance, power our billing, and more…

Page 33: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Nasdaq ArchitectureOn premise AWS Regional (Multi-AZ) Scope AWS (US-East,

primary AZ/VPC)

S3

SNS

Redshift Database

Cluster

HSM Key Appliance

Cluster

MySQL

Redshift Load files/ Manifests

Redshift Snapshots/

Backups

Data Loaded Topic

RMS Input Sources (multiple systems)

Data Ingest Process

Page 34: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

SmartNews

Page 36: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Summary

• AWS enables you to build sophisticated big data applications • Retrospective, Real-time, Predictive

• You can build incrementally, adding use cases and increasing scale as you go

• AWS provides a broad range of security and auditing features to enable you to meet your security requirements

• AWS makes it easy to build hybrid applications that span across your datacenters and the AWS Cloud

Page 37: AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Thank you!

@rahulpathak