AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR and Amazon Redshift

Preview:

Citation preview

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Rahul Pathak, Amazon EMR

March 30, 2016

Building Big Data Solutions with Amazon EMR & Amazon Redshift

Agenda

• AWS Big Data Platform Overview

• Amazon EMR & Amazon Redshift

• Building a Big Data Application

• Customer Use Cases

AWS Big Data Platform

EMR EC2

Analyze

Glacier

S3

Store

Import Export

Collect

Kinesis

Direct Connect

MachineLearningRedshift

New!

AmazonQuickSight

DynamoDB

Amazon EMR – Managed Hadoop Clusters in the Cloud

Scalable Hadoop clusters as a service

Hadoop, Hive, Spark, Presto, Hbase, etc.

Easy to use; fully managed

On demand, reserved, spot pricing

HDFS, Amazon EBS, and S3 filesystems

End to end security

Amazon EMR

Easy to deploy

AWS Management Console

or use the EMR API with your favorite SDK

Command Line

Choose your instance types

CPUC3/C4 family

MACHINE LEARNING

MemoryR3 family

SPARK AND INTERACTIVE

Disk/IOD2/I2 family

LARGEHDFS

GeneralM3/M4 family

BATCH PROCESS

Customize your storage type and size using Amazon EBS

Try different configurations to find your optimal architecture

The Hadoop ecosystem can run in Amazon EMR

Integrated with the AWS Platform

Amazon DynamoDB

EMR-DynamoDB connector

Amazon RDS

Amazon Kinesis

Streaming dataconnectorsJDBC Data Source

w/ Spark SQL

ElasticSearchconnector

Amazon Redshift

Amazon Redshift Copy From HDFS

EMR File System(EMRFS)

Amazon S3

Amazon EMR

Amazon S3 as your persistent data store

Amazon S3Designed for 99.999999999% durabilitySeparate compute and storage

Resize and shut down Amazon EMR clusters with no data loss

Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)

EMRFS makes it easier to leverage Amazon S3

Better performance and error handling options

Transparent to applications – just read/write to “s3://”

Support for Amazon S3 server-side and client-side encryption

Faster listing using EMRFS metadata

HDFS is still available via local instance storage or Amazon EBS

Amazon Redshift

Relational data warehouse

Massively parallel; Petabyte scale

Fully managed

HDD and SSD Platforms

$1,000/TB/Year; start at $0.25/hour

End to end security; built in global DR

Amazon Redshift

Amazon Redshift dramatically reduces I/OColumn storage

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• With row storage you do unnecessary I/O

• To get total amount, you have to read everything

Amazon Redshift dramatically reduces I/OColumn storage

• With column storage, you only read the data you need

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Amazon Redshift dramatically reduces I/OColumn storage

Data compression

• Columnar compression saves space & reduces I/O

• Amazon Redshift analyzes and compresses your data

analyze compression listing;

Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw

Amazon Redshift dramatically reduces I/OColumn storage

Data compression

Zone maps

• Track of the minimum and maximum value for each block

• Skip over blocks that don’t contain the data needed for a given query

• Minimize unnecessary I/O

Amazon Redshift dramatically reduces I/OColumn storage

Data compression

Zone maps

Direct-attached storage

Large data block sizes

• Use direct-attached storage to maximize throughput

• Hardware optimized for high performance data processing

• Large block sizes to make the most of each read

• Amazon Redshift manages durability for you

Amazon Redshift Has Security Built In

SSL to secure data in transit

Encryption to secure data at restAES-256; hardware acceleratedAll blocks on disks and in Amazon S3 encryptedHSM Support

No direct access to compute nodes

Audit logging, AWS CloudTrail, AWS KMS integration

Amazon VPC support

SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3 / DynamoDB / EMR

Customer VPC

InternalVPC

JDBC/ODBC

LeaderNode

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

Building a Big Data Application

Building a Big Data Application

web clients

mobile clients

DBMS

corporate data center

Getting Started

Building a Big Data Application

web clients

mobile clients

DBMS Amazon Redshift

Amazon QuickSight

AWS cloudcorporate data center

Adding a data warehouse

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Amazon QuickSight

AWS cloud

Bringing in Log Data

corporate data center

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Orc/Parquet(Query optimized)

Amazon QuickSight

AWS cloud

Extending your DW to S3

corporate data center

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Orc/Parquet(Query optimized)

Amazon QuickSight

Kinesis Streams

AWS cloud

Adding a real-time layer

corporate data center

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Orc/Parquet(Query optimized)

Amazon QuickSight

Kinesis Streams

AWS cloud

Adding predictive analytics

corporate data center

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Orc/Parquet(Query optimized)

Amazon QuickSight

Kinesis Streams

AWS cloud

Adding encryption at rest with AWS KMS

corporate data centerAWS KMS

Building a Big Data Application

web clients

mobile clients

DBMS

Raw data

Amazon Redshift

Staging Data

Orc/Parquet(Query optimized)

Amazon QuickSight

Kinesis Streams

AWS cloud

AWS KMS

VPC subnet

SSL/TLS

SSL/TLS

Protecting Data in Transit & Adding Network Isolation

corporate data center

Security

• Encryption at rest with choice of key management• Service managed, AWS KMS, CloudHSM, on premise HSM

• Encryption in Transit• Require SSL, all internal communication over SSL/TLS

• Network isolation using Amazon VPC

• Fine grained permissions and auditing using AWS IAM and AWS CloudTrail

Compliance

ISO 9001

SOC 3

SOC 2

ISO 27001

ISO 27017

PCI DSS Level 1ISO 27018

SOC 1 / ISAE 3402

GxPHIPAA

ITAR

FERPA

FISMA, RMF, and DIACAP

FedRAMP

Section 508 / VPAT

DoD SRG Levels 2 & 4

FIPS 140-2

CJIS

Cloud Security Alliance

MPAA

NIST

MLPS Level 3

G-Cloud

IT-Grundschutz

MTCS Tier 3

IRAP Cyber Essentials Plus

Disaster Recovery

• Amazon EMR & Amazon Redshift clusters are resilient and we automatically replace failed nodes/HW

• Data on S3 available in all Availability Zones in a Region

• S3 data can be synced across regions

• Amazon Redshift clusters are continuously backed up to S3 and snapshots can be synced to a second region

Customer Use Cases

Data Source ET

DirectConnect

Client

Forwarder

LoaderState Management

SandboxRedshift

S3Petabytes of data generated on premise and brought to Redshift in the cloud for analysis

High speed connectivity over a redundant pair of Direct Connect leased lines

Stringent security requirements met by leveraging VPC, VPN, Encryption and Rest and In Transit, CloudTrail and database auditing

NTT DOCOMO

Nasdaq – Legacy Warehouse

Expensive ($1.16M annually)

Limited capacity (1 year of data online)

4-8 billion rows inserted per trading day, storing:

• Orders• Trades• Quotes• Market Data• Security Master• Membership

DW can be used to analyze market share, client activity, surveillance, power our billing, and more…

Nasdaq ArchitectureOn premise AWS Regional (Multi-AZ) Scope AWS (US-East,

primary AZ/VPC)

S3

SNS

Redshift Database

Cluster

HSM Key Appliance

Cluster

MySQL

Redshift Load files/ Manifests

Redshift Snapshots/

Backups

Data Loaded Topic

RMS Input Sources (multiple systems)

Data Ingest Process

SmartNews

Summary

• AWS enables you to build sophisticated big data applications • Retrospective, Real-time, Predictive

• You can build incrementally, adding use cases and increasing scale as you go

• AWS provides a broad range of security and auditing features to enable you to meet your security requirements

• AWS makes it easy to build hybrid applications that span across your datacenters and the AWS Cloud

Thank you!

@rahulpathak

Recommended