Building Serverless Data Infrastructure in the AWS Cloud

Preview:

Citation preview

Building ServerlessData Infrastructure in the AWS Cloud

Ryan Plant@ryan_plant

November 10, 2017

ThankstoourSponsors!Partners

Premier

Marquee:

Prize:

Gettheapp!Givefeedback!

WHAT WE’LL COVER

The New Data Economy

Reference Architecture

Using the AWS Cloud

The world’s most valuable resource is no longer oil, but data…

May 6th, 2017

Data => Revenue(but extraction, refinement, packaging, and distribution needed)

DW

Traditional Data Warehousing

Volume, variety, and velocity…

Advanced analytics…

Artificial intelligence…

”What got us here won’t (entirely) get us there…”

Mostly proprietary…

Costly and complex to scale…

Next Generation Data Infrastructure

(i.e. the “data lake”)

James “Data Lake” Dixon

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption –the data lake is a large body of water in a more natural state…

From Data Warehouses to Lakes

A data pond, lake, ocean is not a product it’s an architecture…(and architecture is a principled and pattern-oriented approach to building systems)

Any and all data…Any source and format…

Any time…

WHAT WE’LL COVER

The New Data Economy

Reference Architecture

Using the AWS Cloud

APPS & SOURCES

STORAGE AND PROCESSING LAYER

SERVING LAYER

Storage

Catalog

ProcessingAnalytics

& Artificial

IntelligenceIngestion

Models & Marts

DATA OPS

API

Search Security

Config

Telemetry

Cost Mgmt

DATA OPS

Security

Config

Telemetry

Cost Mgmt

SERVING LAYER

Models & Marts

API

Search

APPS & SOURCES

STORAGE AND PROCESSING LAYER

StorageIngestion

Catalog

ProcessingAnalytics

& Artificial

Intelligence

Data Ingestion Pipelines

SERVICESERVICE

SERVICE

MONOLITHMONOLITH

MONOLITH Change Data Capture(CDC)

STREAMS

MESSAGING

FILE EXTRACTS

STORAGE

source data aggregated, stored indefinitelymany supported formats

append

append

PUT

Securitysegregation & encryption

Storage and Catalog

STORAGE

RAW REFINED

Catalog

• Register source and schema• Data attribute inventory• Relationships and dependencies• Etc…

dataIngestion

Catalog

Raw to Refined Processing Pipelines

STORAGE

RAW REFINED

Processing Pipelines

dataIngestion

C1 C2 C3 C..n

• Preserve RAW data; enrich only• Apply transforms to create new, REFINED

datasets (e.g. customer partitioned views)• Catalog new datasets• Enable new use cases:

• Reporting/Analytical views• Machine/Deep Learning

X Y ZALL DATA

Processing Pipelines

Catalog

Analytics and AI

STORAGE

RAW REFINED

dataIngestion

Analytics and Artificial Intelligence

C1 C2 C3 C..nALL DATAX Y Z

… … …

DATA OPS

Security

Config

Telemetry

Cost Mgmt

APPS & SOURCES

STORAGE AND PROCESSING LAYER

StorageIngestion

Catalog

ProcessingAnalytics

& Artificial

Intelligence

SERVING LAYER

Models & Marts

API

Search

Processing Pipelines

Catalog

Curation and Serving

STORAGE

RAW REFINED

dataIngestion

Analytics and Artificial Intelligence

C1 C2 C3 C..nALL DATAX Y Z

Models and Marts

… … …

Search

… … …

Processing Pipelines

Catalog

STORAGE

RAW REFINED

dataIngestion

Analytics and Artificial Intelligence

C1 C2 C3 C..nALL DATAX Y Z

Models and Marts

… … …

Search

… … …

API

APPS & SOURCES

STORAGE AND PROCESSING LAYER

SERVING LAYER

Storage

Catalog

ProcessingAnalytics

& Artificial

IntelligenceIngestion

Models & Marts

DATA OPS

API

Search Security

Config

Telemetry

Cost Mgmt

WHAT WE’LL COVER

The New Data Economy

Reference Architecture

Using the AWS Cloud

Lots of software, hardware, etc.

TRADITIONAL INVESTMENT IN NEXT GENERATION DATA

CAPITAL AND RISK BARRIERS

acquire/write and maintain software

procure, install, and maintain hardware

get commercial real estate license

PUBLIC CLOUD ECONOMIES OF SCALE

CLOUD OPTIMIZATION

Infrastructure as a ServiceSomeone else’s hardware and real estate

Your software, your (virtual) servers

Platform as a ServiceSomeone else’s software, servers, hardware and real estate

Your custom application software

Software as a ServiceSomeone else’s application software, you provide the data

(everything else doesn’t matter)

Cycle TimeCapital OptimizationDifferentiation Focus

High

Higher

Highest

Go Serverless!(as much as possible)

everything is an event: messages, log entries, file I/Os, clock alarms, etc.listen for events: trigger a handler with an eventstateless event handling: avoid state, persist as event source, handoff as soon as possibleautomation through orchestration and coordination

Principles for event-driven, reactive data infrastructure primed for serverless architectures

StorageIngestion

SQS

SNS

Kinesis

DynamoDB/RDS

event triggers y = f (x)

y = f (x, y)

y = f ([x, y])

event handlers

AWS Glacier(archival)

/{source}-raw/{key}/YYYY-MM-DD/{source}-refined/{key}/YYYY-MM-DD

AWS Lambda AWS S3(ready)

KMS(encryption) lifecycle policies

IAM + Directory(access control)

CloudWatch/Trail

to S3 direct

AWS Step Functions(coordinated state)

Catalog

StorageSources

Ingestion

AWS Glue(serverless ETL/ELT)

source crawlers

metadata

classifier

classifierdoSomething(…) {…} trigger

Processing Pipelines

jobs and job runner

To Targets

Catalog

Storage

Sources &

Targets Ingestion

Processing Pipelines

AWS Glue(serverless ETL/ELT)

AWS EMR(Managed Hadoop)

Streaming

Kinesis

Batch

AWS Batch

Targets &

SourcesIngestion

Serving Layer

Catalog

Storage

Processing Pipelines

AWS Glue(serverless ETL/ELT)

Serving Layer

AWS ElasticSearch(managed ES)

AWS RedShiftSpectrum

(Parallel DW)

SourcesIngestion

AWS Athena(Ad-hoc Query)

Catalog

Storage

Processing Pipelines

Serving Layer

SourcesIngestion

AWS API Gateway(serverless APIs)

AWS QuickSight(visualization)

AWS Cognito(Web/Mobile Identity and SSO)

WHAT WE’LL COVER

The New Data Economy

Reference Architecture

Using the AWS Cloud

CLOUD OPTIMIZATION

Infrastructure as a ServiceSomeone else’s hardware and real estate

Your software, your (virtual) servers

Platform as a ServiceSomeone else’s software, servers, hardware and real estate

Your custom application software

Software as a ServiceSomeone else’s application software, you provide the data

(everything else doesn’t matter)

Cycle TimeCapital OptimizationDifferentiation Focus

High

Higher

Highest

CLOUD OPTIMIZATION

Infrastructure as a ServiceSomeone else’s hardware and real estate

Your software, your (virtual) servers

Platform as a ServiceSomeone else’s software, servers, hardware and real estate

Your custom application software

Software as a ServiceSomeone else’s application software, you provide the data

(everything else doesn’t matter)

You are likely here…

Aim here…

TBD

Opportunity!

Public Cloud R&D Investment

SERVERLESS: USE CAUTION

The floor is wet (and is constantly getting mopped!)

The edges are sharp:• Development, Test, Debug tools and experience• Configuration and Deployment challenges• Variable, non-deterministic performance

Extremely new (but inevitable) paradigm…

Recommended